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Preface 


This book is an introductory textbook in undergraduate probability. It has a mission: to spell 
out the motivation, intuition, and implication of the probabilistic tools we use in science 
and engineering. From over half a decade of teaching the course, I have distilled what I 
believe to be the core of probabilistic methods. I put the book in the context of data science 
to emphasize the inseparability between data (computing) and probability (theory) in our 
time. 

Probability is one of the most interesting subjects in electrical engineering and com- 
puter science. It bridges our favorite engineering principles to the practical reality, a world 
that is full of uncertainty. However, because probability is such a mature subject, the under- 
graduate textbooks alone might fill several rows of shelves in a library. When the literature 
is so rich, the challenge becomes how one can pierce through to the insight while diving into 
the details. For example, many of you have used a normal random variable before, but have 
you ever wondered where the “bell shape” comes from? Every probability class will teach 
you about flipping a coin, but how can “flipping a coin” ever be useful in machine learning 
today? Data scientists use the Poisson random variables to model the internet traffic, but 
where does the gorgeous Poisson equation come from? This book is designed to fill these 
gaps with knowledge that is essential to all data science students. 

This leads to the three goals of the book. (i) Motivation: In the ocean of mathematical 
definitions, theorems, and equations, why should we spend our time on this particular topic 
but not another? (ii) Intuition: When going through the derivations, is there a geometric 
interpretation or physics beyond those equations? (iii) Implication: After we have learned a 
topic, what new problems can we solve? 

The book’s intended audience is undergraduate juniors/seniors and first-year gradu- 
ate students majoring in electrical engineering and computer science. The prerequisites are 
standard undergraduate linear algebra and calculus, except for the section about charac- 
teristic functions, where Fourier transforms are needed. An undergraduate course in signals 
and systems would suffice, even taken concurrently while studying this book. 

The length of the book is suitable for a two-semester course. Instructors are encouraged 
to use the set of chapters that best fits their classes. For example, a basic probability course 
can use Chapters 1-5 as its backbone. Chapter 6 on sample statistics is suitable for students 
who wish to gain theoretical insights into probabilistic convergence. Chapter 7 on regression 
and Chapter 8 on estimation best suit students who want to pursue machine learning and 
signal processing. Chapter 9 discusses confidence intervals and hypothesis testing, which are 
critical to modern data analysis. Chapter 10 introduces random processes. My approach for 
random processes is more tailored to information processing and communication systems, 
which are usually more relevant to electrical engineering students. 

Additional teaching resources can be found on the book’s website, where you can 


find lecture videos and homework videos. Throughout the book you will see many “practice 
exercises” , which are easy problems with worked-out solutions. They can be skipped without 
loss to the flow of the book. 
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Chapter 1 


Mathematical Background 


“Data science” has different meanings to different people. If you ask a biologist, data science 
could mean analyzing DNA sequences. If you ask a banker, data science could mean pre- 
dicting the stock market. If you ask a software engineer, data science could mean programs 
and data structures; if you ask a machine learning scientist, data science could mean models 
and algorithms. However, one thing that is common in all these disciplines is the concept of 
uncertainty. We choose to learn from data because we believe that the latent information 
is embedded in the data — unprocessed, contains noise, and could have missing entries. If 
there is no randomness, all data scientists can close their business because there is simply 
no problem to solve. However, the moment we see randomness, our business comes back. 
Therefore, data science is the subject of making decisions in uncertainty. 

The mathematics of analyzing uncertainty is probability. It is the tool to help us model, 
analyze, and predict random events. Probability can be studied in as many ways as you can 
think of. You can take a rigorous course in probability theory, or a “probability for dummies” 
on the internet, or a typical undergraduate probability course offered by your school. This 
book is different from all these. Our goal is to tell you how things work in the context of data 
science. For example, why do we need those three axioms of probabilities and not others? 
Where does the “bell shape” Gaussian random variable come from? How many samples do 
we need to construct a reliable histogram? These questions are at the core of data science, 
and they deserve close attention rather than sweeping them under the rug. 

To help you get used to the pace and style of this book, in this chapter, we review some 
of the very familiar topics in undergraduate algebra and calculus. These topics are meant 
to warm up your mathematics background so that you can follow the subsequent chapters. 
Specifically, in this chapter, we cover several topics. First, in Section 1.1 we discuss infinite 
series, something that will be used frequently when we evaluate the expectation and variance 
of random variables in Chapter 3. In Section 1.2 we review the Taylor approximation, 
which will be helpful when we discuss continuous random variables. Section 1.3 discusses 
integration and reviews several tricks we can use to make integration easy. Section 1.4 
deals with linear algebra, aka matrices and vectors, which are fundamental to modern data 
analysis. Finally, Section 1.5 discusses permutation and combination, two basic techniques 
to count events. 


CHAPTER 1. MATHEMATICAL BACKGROUND 


1.1 Infinite Series 


Imagine that you have a fair coin. If you get a tail, you flip it again. You do this repeatedly 
until you finally get a head. What is the probability that you need to flip the coin three 
times to get one head? 


This is a warm-up exercise. Since the coin is fair, the probability of obtaining a head 


+. The probability of getting a tail followed by a head is ; x 5 = + Similarly, the 


S 5 


probability of getting two tails and then a head is 5 x 5 x 5 = ¢. If you follow this logic, you 
can write down the probabilities for all other cases. For your convenience, we have drawn the 
first few in Figure 1.1. As you have probably noticed, the probabilities follow the pattern 


{5,55 g0---}: 


Figure 1.1: Suppose you flip a coin until you see a head. This requires you to have N — 1 tails followed 


by a head. The probability of this sequence of events are s, i o ..., which forms an infinite sequence. 


We can also summarize these probabilities using a familiar plot called the histogram 
as shown in Figure 1.2. The histogram for this problem has a special pattern, that every 
value is one order higher than the preceding one, and the sequence is infinitely long. 


0.5 


0.4 


0.3 


0.2 


0.1 


12 3 4 5 6 7 8 9 10 


Figure 1.2: The histogram of flipping a coin until we see a head. The z-axis is the number of coin flips, 
and the y-axis is the probability. 


Let us ask something harder: On average, if you want to be 90% sure that you will 
get a head, what is the minimum number of attempts you need to try? Five attempts? 
Ten attempts? Indeed, if you try ten attempts, you will very likely accomplish your goal. 
However, this would seem to be overkill. If you try five attempts, then it becomes unclear 
whether you will be 90% sure. 


1.1. INFINITE SERIES 


This problem can be answered by analyzing the sequence of probabilities. If we make 
two attempts, then the probability of getting a head is the sum of the probabilities for one 
attempt and that of two attempts: 


1 
P[success after 1 attempt] = >= 0.5 
1 1 
Plsuccess after 2 attempts] = 5 + i 0.75 
Therefore, if you make 3 attempts or 4 attempts, you get the following probabilities: 
1 1 1 
P[success after 3 attempts] = ora. = 0.875 
P[success after 4 attempts] = eae ae: + a 0.9375 
a ee ee ae acne 


So if we try four attempts, we will have a 93.75% probability of getting a head. Thus, four 
attempts is the answer. 
The MATLAB / Python codes we used to generate Figure 1.2 are shown below. 


MATLAB code to generate a geometric sequence 
1/2; 
1:10; 
p.n; 
bar(n,X,’FaceColor’,[0.8, 0.2,0.2]); 


# Python code to generate a geometric sequence 
import numpy as np 
import matplotlib.pyplot as plt 
1/2 
np.arange (0,10) 
np. power (p,n) 
. bar (n, X) 


This warm-up exercise has perhaps raised some of your interest in the subject. However, 
we will not tell you everything now. We will come back to the probability in Chapter 3 
when we discuss geometric random variables. In the present section, we want to make sure 
you have the basic mathematical tools to calculate quantities, such as a sum of fractional 
numbers. For example, what if we want to calculate P[success after 107 attempts]? Is there 
a systematic way of performing the calculation? 


Remark. You should be aware that the 93.75% only says that the probability of achieving 
the goal is high. If you have a bad day, you may still need more than four attempts. Therefore, 
when we stated the question, we asked for 90% “on average”. Sometimes you may need 
more attempts and sometimes fewer attempts, but on average, you have a 93.75% chance 
of succeeding. 


1.1.1 Geometric Series 


A geometric series is the sum of a finite or an infinite sequence of numbers with a constant 
ratio between successive terms. As we have seen in the previous example, a geometric series 
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appears naturally in the context of discrete events. In Chapter 3 of this book, we will use 
geometric series when calculating the expectation and moments of a random variable. 


Definition 1.1. Let 0 <r <1, a finite geometric sequence of power n is a sequence 


[ones 


An infinite geometric sequence is a sequence of numbers 


2 2 
{tin ate re 


of numbers 


Theorem 1.1. The sum of a finite geometric series of power n is 


n J—yrtl 
Sirk = ltr tr? te. tr? = ——_ 
= l-r 


Proof. We multiply both sides by 1 — r. The left hand side becomes 


(>>) (l-r)=(1 bp? eee eye r) 


k=0 


where (a) holds because terms are canceled due to subtractions. 


A corollary of Equation (1.1) is the sum of an infinite geometric sequence. 


Corollary 1.1. LetO0<r<1. The sum of an infinite geometric series is 


Remark. Note that the condition 0 < r < 1 is important. If r > 1, then the limit 
limn+oor”*! in Equation (1.2) will diverge. The constant r cannot equal to 1, for oth- 
erwise the fraction (1 — r”*+)/(1—r) is undefined. We are not interested in the case when 
r = 0, because the sum is trivially 1: 772.0" =14+01+0?+---=1. 


1.1. INFINITE SERIES 


CO 
Practice Exercise 1.1. Compute the infinite series > ae 

k=2 
Solution. 


=e 
De 


— 
Il 
wo 


Remark. You should not be confused about Fa oer series and a harmonic series. A 
harmonic series concerns with the sum of {1 .}. It turns out that? 


Sag gee 
Se aie eas =0o 
7 a ae " 


On the other hand, a squared harmonic series {1, x: =: z: ...} converges: 


ya _« 
rete 92" 32." 42 6 


The latter result is known as the Basel problem. 
We can extend the main theorem by considering more complicated series, for example 
the following one. 


Corollary 1.2. LetO0<r <1. It holds that 


ee 
Cae 


De er RS Op ogre ae 
(=a 


Proof. Take the derivative on both sides of Equation (1.2). The left hand side becomes 


le (l+rt4r? Si --) 


=142r+43r?4---= So kerk? 


d 1 1 
The right hand side becomes a : 
dr \l-r (l—r)? 


1This result can be found in Tom Apostol, Mathematical Analysis, 2nd Edition, Theorem 8.11. 
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Practice Exercise 1.2. Compute the infinite sum 77°, k- 3E- 


Solution. We can use the derivative result: 


1.1.2 Binomial Series 


A geometric series is useful when handling situations such as N — 1 failures followed by 
a success. However, we can easily twist the problem by asking: What is the probability 
of getting one head out of 3 independent coin tosses? In this case, the probability can be 
determined by enumerating all possible cases: 


P{1 head in 3 coins] = P[H,T + P[T,H,T] + P[T,T,H] 


, oa i. tf 
(3 «s)+(Sx5%5)+(5%5%5) 
3 

~ 8 


Figure 1.3 illustrates the situation. 


Figure 1.3: When flipping three coins independently, the probability of getting exactly one head can 
come from three different possibilities. 


What lessons have we learned in this example? Notice that you need to enumerate 
all possible combinations of one head and two tails to solve this problem. The number is 
3 in our example. In general, the number of combinations can be systematically studied 
using combinatorics, which we will discuss later in the chapter. However, the number of 
combinations motivates us to discuss another background technique known as the binomial 
series. The binomial series is instrumental in algebra when handling polynomials such as 
(a +b)? or (14 2)3. It provides a valuable formula when computing these powers. 


Theorem 1.2 (Binomial theorem). For any real numbers a and b, the binomial series 
of power n is 


(1.4) 


where (7) = an 


1.1. INFINITE SERIES 


The binomial theorem is valid for any real numbers a and b. The quantity (7) reads 


as “n choose k”. Its definition is 
n\ def n! 
k} k(n —k)V 


where n! = n(n — 1)(n — 2)---3-+ 2-1. We shall discuss the physical meaning of (/) in 
Section 1.5. But we can quickly plug in the “n choose k” into the coin flipping example by 


letting n =3 and k=1: 
fms i 3 3! 
Number of combinations for 1 head and 2 tails = 1) ia 


So you can see why we want you to spend your precious time learning about the binomial 
theorem. In MATLAB and Python, es) can be computed using the commands as follows. 


MATLAB code to compute (N choose K) and K! 
10; 
2; 
nchoosek(n,k) 
factorial (k) 


# Python code to compute (N choose K) and K! 
from scipy.special import comb, factorial 
10 


comb(n, k) 
factorial (k) 


The binomial theorem makes the most sense when we also learn about the Pascal’s 
identity. 


Theorem 1.3 (Pascal’s identity). Let n and k be positive integers such that k < n. 


Then, 
C : 7 (1.5) 


ies 


Proof. We start by recalling the definition of Ce This gives us 


e : (,." ‘ - te Ol b= nies (k—1))! 


1 1 
= or tk Din=k+ x) 


where we factor out n! to obtain the second equation. Next, we observe that 


1 (m—kA+1) no-k+1 
kli(n—ky)l* (n—-k+1). Kli(n—k+D 
1 k k 
(k—-Ili(n—k+1D!~k M(n—k +I 
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Substituting into the previous equation we obtain 


i ea) =” (eee Bee 
(re) + (621) ="! Gagan aecee) 


The Pascal triangle is a visualization of the coefficients of (a + b)” as shown in Fig- 
ure 1.4. For example, when n = 5, we know that i) = 10. However, by Pascal’s identity, we 
know that (3) = Gh + (i So the number 10 is actually obtained by summing the numbers 
4 and 6 of the previous row. 


ay 
e 
,N 
\ a 
\ 
ll 
\ 
‘ 
‘ 
‘ 
iy 
\ 
‘ 
erernn====8 
et 


Figure 1.4: Pascal triangle for n = 0,...,5. Note that a number in one row is obtained by summing 
two numbers directly above it. 


Practice Exercise 1.3. Find (1+ 2)°. 


Solution. Using the binomial theorem, we can show that 


“(3 
(l+2)?= S- (;) Poe? = 1 eae ee 
k=0 


Practice Exercise 1.4. Let 0 < p< 1. Find 


n 


De 


k=0 


Solution. By using the binomial theorem, we have 


n 


NE Gime —p)* =(p+(1~p))" =1. 


k=0 


This result will be helpful when evaluating binomial random variables in Chapter 3. 


1.1. INFINITE SERIES 


We now prove the binomial theorem. Please feel free to skip the proof if this is your first 


time reading the book. 


Proof of the binomial theorem. We prove by induction. When n = 1, 


1 
(a+b)'=a+b= yak. 
k=0 


Therefore, the base case is verified. Assume up to case n. We need to verify case n + 1. 


(a+b)"** = (a+b)(a+b)” = (a+b) 3 ) gee 
k=0 


(1) gn htlyk 4 S- (;) aq? prt. 
k=0 


We want to apply the Pascal’s identity to combine the two terms. In order to do so, we note 
that the second term in this sum can be rewritten as 


i 
a 


k=0 


n 


“ n n—kzk+1 _ n n+1—k—1,k+1 
EGem Eom 


k=0 k=0 


n+1 
=->> é " :) a’t1-!, where £=k+1 


_ S- n ag? t1-£p¢ p. prt. 
£-1 
f=1 


The first term in the sum can be written as 


S- (;) qh kt+lpk = S- (‘) grtl—£pe 4 gt, where £=&: 
k=0 l=1 


Therefore, the two terms can be combined using Pascal’s identity to yield 


fa+ay = S- (7) a Gs )| gr ti—epe 4 qrtt 4 prt 


l=1 
n n+1 
= & 7 ‘) ar tt—tyt 4 gn tl 4 gntl — (" 7 ") ah tl—epe. 
f=1 £=0 


Hence, the (n + 1)th case is also verified. By the principle of mathematical induction, we 
have completed the proof. 


The end of the proof. Please join us again. 
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1.2 Approximation 


Consider a function f(a) = log(1+.), for x > 0 as shown in Figure 1.5. This is a nonlinear 
function, and we all know that nonlinear functions are not fun to deal with. For example, 
if you want to integrate the function fp xlog(1 +a) dz, then the logarithm will force you 
to do integration by parts. However, in many practical problems, you may not need the full 
range of « > 0. Suppose that you are only interested in values x < 1. Then the logarithm 
can be approximated, and thus the integral can also be approximated. 


2 al 0.2 4 


? 


—f(«) = log(1 + 2)) 
vont f(c) =a 


¢ 


0.15 


0.1 


0.05 


0 0.05 0.4 0.145 0.2 


na 


Figure 1.5: The function f(x) = log(1+ 2) and the approximation f(a) = a. 


To see how this is even possible, we show in Figure 1.5 the nonlinear function f(a) = 
log(1 +2) and an approximation f(a) = x. The approximation is carefully chosen such that 
for « < 1, the approximation f(z) is close to the true function f(x). Therefore, we can 
argue that for z <1, 


log(1+2) a, (1.6) 
thereby simplifying the calculation. For example, if you want to integrate xlog(1+ x) for 
0 <a < 0.1, then the integral can be approximated by x xlog(1+2a) dx = aa x? dx = 
ae = 3.33 x 1074. (The actual integral is 3.21 x 1074.) In this section we will learn about 


the basic approximation techniques. We will use them when we discuss limit theorems in 
Chapter 6, as well as various distributions, such as from binomial to Poisson. 


1.2.1. Taylor approximation 


Given a function f : R > R, it is often useful to analyze its behavior by approximating f 
using its local information. Taylor approximation (or Taylor series) is one of the tools for 
such a task. We will use the Taylor approximation on many occasions. 


Definition 1.2 (Taylor Approximation). Let f : R > R be a continuous function with 


infinite derivatives. Let a € R be a fixed constant. The Taylor approximation of f at 


10 
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where f(") denotes the nth-order derivative of f. 


Taylor approximation is a geometry-based approximation. It approximates the function 
according to the offset, slope, curvature, and so on. According to Definition 1.2, the Taylor 
series has an infinite number of terms. If we use a finite number of terms, we obtain the 
nth-order Taylor approximation: 


First-Order : f(z) = f(a) + f' (a)(e@ — a) + O(a — a)”) 
Ne  e 
offset slope 
Second-Order : f(x) = £(@) + f'(a)(e— a) f {0 (x — a)? + O((x — a)?). 
offset slope a 


curvature 


Here, the big-O notation O(c*) means any term that has an order at least power k. For 
small ¢, ie., ¢ <1, a high-order term O(e*) ~ 0 for large k. 


Example 1.1. Let f(a) = sinz. Then the Taylor approximation at x = 0 is 


OF Oe a: 


ay (P— 9) 3] 


0) 


= sin(0) + (cos 0)(a — 0) 


We show the first few approximations in Figure 1.6. 

One should be reminded that Taylor approximation approximates a function f(x) 
at a particular point x = a. Therefore, the approximation of f near « = O and the 
approximation of f near « = 7/2 are different. For example, the Taylor approximation 
at « = 7/2 for f(a) = sina is 


=i 


EG 7 sin = T\2 cosé 
a ” 7 (2 a 3 (« 
i 
0 
4 


y aa (« we 
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TT 
mms sin X mm sin X 

3rd order 3rd order 
—— 5th order | | [ —— 5th order | | 
——7th order ——7th order 


0 5 
x 


(a) Approximate at 7 = 0 (b) Approximate at x = 7/2 


Figure 1.6: Taylor approximation of the function f(x) = sina. 


1.2.2 Exponential series 


An immediate application of the Taylor approximation is to derive the exponential series. 


Theorem 1.4. Let x be any real number. Then, 


xe 
cell — — CITI —4 
e a eres oe 


f(x) = f(0) + f’(0)(x — 0) 4 Fe OQ)? +. 
e° ‘ 
=e +e(2—0)+ ay 0)* 4 
Stee ey = = 
k=0 


Nicma 


co 
Practice Exercise 1.5. Evaluate XS kl 


k=0 
Solution. 


Co CoO iS 
Nee r v OA 
SS ce — =e *e*=1. 
] » | 
k!} = k! 


This result will be useful for Poisson random variables in Chapter 3. 


k=0 
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If we substitute x = 79 where j = /—1, then we can show that 


sul 


Je =14+j0+ + +- 
=cos 0+ 7 sin 0 


real imaginary 


Matching the real and the imaginary terms, we can show that 


62g 
cos@ =1— a +a t 
o> 6 


sind =0— 2 += + 


This gives the infinite series representations of the two trigonometric functions. 


1.2.3. Logarithmic approximation 


Taylor approximation also allows us to find approximations to logarithmic functions. We 
start by presenting a lemma. 


Lemma 1.1. Let 0 <x <1 be a constant. Then, 


2 


log(1 +2) =2—- = +O(23 


Proof. Let f(z) = log(1+ x). Then, the derivatives of f are 


1 


and f(x) = Gata? 


fi(a)= (+2)’ 


Taylor approximation at x = 0 gives 


F(a) = f(0) + f'(0)(@ - 0) + ——(e - 0)? + Of@") 


=los1+ (Gay) (aso) 5 +00 


2 
=x“- > + O(a). 


The difference between this result and the result we showed in the beginning of this 
section is the order of polynomials we used to approximate the logarithm: 


e First-order: log(1 + x#) =x 
e Second-order: log(1 + 2) = x — x?/2. 


What order of approximation is good? It depends on where you want the approximation to 
be good, and how far you want the approximation to go. The difference between first-order 
and second-order approximations is shown in Figure 1.7. 
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2 2 2 
1.5/7 1.5/7 
1; 1; 


0.5 =——f(z)=log(1+2)/7 0.5 m= f(x) = log(1+2)|j 
ves fla) =a vv ieeee fle) = 2 — 22/2 
0 ; 0 . 
0 1 2 3 4 1 2 3 4 5 


First-order approximation Second-order approximation 


na 


Figure 1.7: The function f(x) = log(1+ 2), the first-order approximation f(x) = x, and the second- 
order approximation f(x) = x — «7/2. 


Example 1.2. When we prove the Central Limit Theorem in Chapter 6, we need to 
use the following result. 


The proof of this equation can be done using the Taylor approximation. Consider 
N log (1 + =). By the logarithmic lemma, we can obtain the second-order approxi- 


s? Ss 
log (14 = 
Re ( a) aN 


Therefore, multiplying both sides by N yields 


mation: 


32 
N log (1 a) = 


Putting the limit N — oo we can show that 


lim {Nee (1 1 
N-0co 
Taking exponential on both sides yields 


gs? Se 
exp jim Vlog (1+ an) = on { FI. 


Moving the limit outside the exponential yields the result. Figure 1.8 provides a pic- 
torial illustration. 
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0.8 1 


2N 


N 
Figure 1.8: We plot a sequence of function f(x) = (1 + ir) and its limit f(a) = a. 


1.3. Integration 


When you learned calculus, your teacher probably told you that there are two ways to 


compute an integral: 
1 
J flax) dx = = f flu) du. 


ude =ue~ fvdu. 


Besides these two, we want to teach you two more. The first technique is even and odd 
functions when integrating a function symmetrically about the y-axis. If a function is even, 
you just need to integrate half of the function. If a function is odd, you will get a zero. The 
second technique is to leverage the fact that a probability density function integrates to 1. 
We will discuss the first technique here and defer the second technique to Chapter 4. 
Besides the two integration techniques, we will review the fundamental theorem of 
calculus. We will need it when we study cumulative distribution functions in Chapter 4. 


e Substitution: 


e By parts: 


1.3.1 Odd and even functions 


Definition 1.3. A function f :R — R is even if for any x € R, 


and f is odd if 


15 
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Essentially, an even function flips over about the y-axis, whereas an odd function flips over 
both the x- and y-axes. 


Example 1.3. The function f(x) = x? — 0.42 is even, because 


0.A(—a)* = 27? — 0Az* = f(a). 


See Figure 1.9(a) for illustration. When integrating the function, we have 


1 1 1 3 cl 
i = Dee a) ss eas & 
[ fear=2 | fla)de=2 fs 0.4 te =2)7 =] = 


(—2x) exp - Se \ = —x exp { 


See Figure 1.9(b) for illustration. When integrating the function, we can let u = —2. 
Then, the integral becomes 


[ terar= fi sear fH) a 
= fiw aus fH) a 


=- [dus f' 702) av=0. 


15 1 05 0 O05 1 15 15-1 05 0 05 1 15 
x x 
(a) Even function (b) Odd function 


Figure 1.9: An even function is symmetric about the y-axis, and so the integration ea f(x) dx = 
2 {5 f(a) da. An odd function is anti-symmetric about the y-axis. Thus, [", f(x) dx = 0. 
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1.3.2 Fundamental Theorem of Calculus 


Our following result is the Fundamental Theorem of Calculus. It is a handy tool that links 
integration and differentiation. 


Theorem 1.5 (Fundamental Theorem of Calculus). Let f : [a,b] + R be a continu- 
ous function defined on a closed interval [a,b]. Then, for any x € (a,b), 


ce — - [ f(t) dt, (1.12) 


Before we prove the result, let us understand the theorem if you have forgotten its meaning. 


Example 1.5. Consider a function f(t) = ¢?. If we integrate the function from 0 to 
zx, we will obtain another function 


fa x 3 
F(z) aa f(t) dt = | 2 dt=—. 
0 0 3 
On the other hand, we can differentiate F(x) to obtain f(z): 


d ax 
f() = 5-F(a) = = 0. 


The fundamental theorem of calculus basically puts the two together: 


fia)= = f Ho ae 


That’s it. Nothing more and nothing less. 


How can the fundamental theorem of calculus ever be useful when studying probabil- 
ity? Very soon you will learn two concepts: probability density function and cumulative 
distribution function. These two functions are related to each other by the fundamental 
theorem of calculus. To give you a concrete example, we write down the probability density 
function of an exponential random variable. (Please do not panic about the exponential 
random variable. Just think of it as a “rapidly decaying” function.) 


f(a)=e", «>0. 


It turns out that the cumulative distribution function is 
x zx 
F(a)= | f(t) at = f e 'dt=1—e™. 
0 0 


You can also check that f(z) = 4 F(x). The fundamental theorem of calculus says that if 
you tell me F(x) = f> e~ dt (for whatever reason), I will be able to tell you that f(a) = e~* 
merely by visually inspecting the integrand without doing the differentiation. 

Figure 1.10 illustrates the pair of functions f(a) = e~* and F(a) = 1—e7*. One thing 
you should notice is that the height of F(x) is the area under the curve of f(t) from —oo to x. 
For example, in Figure 1.10 we show the area under the curve from 0 to 2. Correspondingly 
in F(a), the height is F(2). 
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f(z) 


Figure 1.10: The pair of functions f(x) =e” and F(x) =1l—e” 


The following proof of the Fundamental Theorem of Calculus can be skipped if it is your 


first time reading the book. 


Proof. Our proof is based on Stewart (6th Edition), Section 5.3. Define the integral as a 
function F: 


F(a) = 1 ” F(t) at. 


The derivative of F' with respect to x is 


2 ahs aia F(a +h) — F(x) 
x h-0 h 

1 ath x 

=] = t) dt — t) dt 

ims (fsa fo 10 

1 ath 

ni dt 

oe omen, F) 


=a {28% I cn} 7 
Here, the inequality in (a) holds because 
< 

= 2) 


for alla <t<a+h. The maximum exists because f is continuous in a closed interval. 
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Using the parallel argument, we can show that 


d _ 4, Pl@+h)— F(z) 
.. h 


a+h x 
= jim + (/ f(t) a— | F(t) i) 


ath 
= lim iff f(t) dt 


noh 


ath 
soe 
~~ pete me fe {anim £07} dt 
~ 30 {i rr)} : 


Combining the two results, we have that 


lim 4 min sr)} < Zr (e) < tm { max s(n). 


h-0 {ain h>0 | ax<r<ath 


However, since the two limits are both converging to f(x) as h — 0, we conclude that 


de F(x) = f(2). 


Remark. An alternative proof is to use Mean Value Theorem in terms of Riemann-Stieltjes 
integrals (see, e.g., Tom Apostol, Mathematical Analysis, 2nd edition, Theorem 7.34). To 
handle more general functions such as delta functions, one can use techniques in Lebesgue’s 
integration. However, this is beyond the scope of this book. 


This is the end of the proof. Please join us again. 


In many practical problems, the fundamental theorem of calculus needs to be used in 
conjunction with the chain rule. 


Corollary 1.3. Let f : [a,b] > R be a continuous function defined on a closed interval 
[a,b]. Let g : R > [a,b] be a continuously differentiable function. Then, for any x € 


(a,b), 
(1.13) 


Proof. We can prove this with the chain rule: Let y = g(a). Then we have 


d a) dy d 
af toa=2 2 fp d=9'@ Fu, 


which completes the proof. 
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Practice Exercise 1.6. Evaluate the integral 


d i 1 . te it 
x : 
dx Jo V2ra? e 2c? 


Solution. Let y = x — py. Then by using the fundamental theorem of calculus, we can 
show that 


a fre il e dy d f¥ 1 

el, eee eal, ae 
d(a — p) 1 
arr re | 


1 (x — p)? 
V 2102 20? 
This result will be useful when we do linear transformations of a Gaussian random 
variable in Chapter 4. 


1.4 Linear Algebra 


The two most important subjects for data science are probability, which is the subject of the 
book you are reading, and linear algebra, which concerns matrices and vectors. We cannot 
cover linear algebra in detail because this would require another book. However, we need to 
highlight some ideas that are important for doing data analysis. 


1.4.1 Why do we need linear algebra in data science? 


Consider a dataset of the crime rate of several cities as shown below, downloaded from 
https://web.stanford.edu/~hastie/StatLearnSparsity/data.html. 

The table shows that the crime rate depends on several factors such as funding for the 
police department, the percentage of high school graduates, etc. 


city | crime rate | funding hs no-hs_ college college4 
1 478 40 74 11 31 20 
2 494 32 72 11 43 18 
3 643 57 71 18 16 16 
A 341 3l 71 11 25 19 
50 940 66 67 26 18 16 
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What questions can we ask about this table? We can ask: What is the most influential 
cause of the crime rate? What are the leading contributions to the crime rate? To answer 
these questions, we need to describe these numbers. One way to do it is to put the numbers 
in matrices and vectors. For example, 


478 40 74 

494 32 72 
Yerime = : > fund = : » hs = : pees 

940 66 67 


With this vector expression of the data, the analysis questions can roughly be translated 
to finding (6’s in the following equation: 


Yerime — PrundL tuna a BnsLhs i Beotlege£ colleges: 


This equation offers a lot of useful insights. First, it is a linear model of y.j,-- We call 
it a linear model because the observable yj ¢ iS written as a linear combination of the 
variables fund, Lhs, etc. The linear model assumes that the variables are scaled and added 
to generate the observed phenomena. This assumption is not always realistic, but it is often 
a fair assumption that greatly simplifies the problem. For example, if we can show that all 
6’s are zero except (rund, then we can conclude that the crime rate is solely dependent on 
the police funding. If two variables are correlated, e.g., high school graduate and college 
graduate, we would expect the (’s to change simultaneously. 
The linear model can further be simplified to a matrix-vector equation: 


| | | | B fund 
| | | | Bos 


Yorime | = |®fund hs **° Lcollege4 


| | | | Boollegea 


Here, the lines emphasize that the vectors are column vectors. If we denote the matrix 
in the middle as A and the vector as G, then the equation is equivalent to y = AG. So we 
can find 8 by appropriately inverting the matrix A. If two columns of A are dependent, we 
will not be able to resolve the corresponding (’s uniquely. 

As you can see from the above data analysis problem, matrices and vectors offer a way 
to describe the data. We will discuss the calculations in Chapter 7. However, to understand 
how to interpret the results from the matrix-vector equations, we need to review some basic 
ideas about matrices and vectors. 


le 


1.4.2 Everything you need to know about linear algebra 


Throughout this book, you will see different sets of notations. For linear algebra, we also 
have a set of notations. We denote 2 € R® a d-dimensional vector taking real numbers as its 
entries. An M-by-N matrix is denoted as X € R“@*. The transpose of a matrix is denoted 
as X7. A matrix X can be viewed according to its columns and its rows: 


— £ —_ 
= £ = 
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Here, x; denotes the jth column of X, and x’ denotes the ith row of X. The (i, j)th element 
of X is denoted as x;; or [X];;. The identity matrix is denoted as I. The ith column of I 
is denoted as e; = [0,...,1,...,0]”, and is called the ith standard basis vector. An all-zero 
vector is denoted as 0 = [0,...,0]”. 

What is the most important thing to know about linear algebra? From a data analysis 
point of view, Figure 1.11 gives us the answer. The picture is straightforward, but it captures 
all the essence. In almost all the data analysis problems, ultimately, there are three things we 
care about: (i) The observable vector y, (ii) the variable vectors x,, and (iii) the coefficients 
B,. The set of variable vectors fot spans a vector space in which all vectors are living. 
Some of these variable vectors are correlated, and some are not. However, for the sake of 
this discussion, let us assume they are independent of each other. Then for any observable 
vector y, we can always project y in the directions determined by {x,,}_,. The projection 
of y onto 2, is the coefficient (,,. A larger value of 6, means that the variable x, has more 
contributions. 


“1 


Figure 1.11: Representing an observable vector y by a linear combination of variable vectors a1, x2 
and x3. The combination weights are (1, G2, (3. 


Why is this picture so important? Because most of the data analysis problems can be 
expressed, or approximately expressed, by the picture: 


N 
y= S- Bn&n. 
n=1 


If you recall the crime rate example, this equation is precisely the linear model we used to 
describe the crime rate. This equation can also describe many other problems. 


Example 1.6. Polynomial fitting. Consider a dataset of pairs of numbers (tm, Ym) for 
m =1,...,M, as shown in Figure 1.12. After a visual inspection of the dataset, we 
propose to use a line to fit the data. A line is specified by the equation 


Ym = Atm +b, Tr — eee Vie 


where a € R is the slope and b € R is the y-intercept. The goal of this problem is to 
find one line (which is fully characterized by (a,b)) such that it has the best fit to all 
the data pairs (tm, Ym) for m =1,...,M. This problem can be described in matrices 
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and vectors by noting that 


or more compactly, 
y = P21 + Boke. 


Here, 2, = [t1,...,¢,¢]" contains all the variable values, and #2 = [1,...,1]" contains 
a constant offset. 


5 


tm Ym 
0.1622 2.1227 
0.7943 3.3354 


0.7379 3.4054 L< 
2 © data 
0.2691 2.5672 6 === best fit 


0.4228 2.3796 candidate 
0.6020 3.2942 


0.2 0.4 0.6 0.8 1 


Figure 1.12: Example of fitting a set of data points. The problem can be described by y = 
Bie. + Boxe. 


Example 1.7. Image compression. The JPEG compression for images is based on 
the concept of discrete cosine transform (DCT). The DCT consists of a set of basis 
vectors, or {x,,}4_, using our notation. In the most standard setting, each basis vector 
Xp consists of 8 x 8 pixels, and there are N = 64 of these zx,,’s. Given an image, we can 
partition the image into M small blocks of 8 x 8 pixels. Let us call one of these blocks 
y. Then, DCT represents the observation y as a linear combination of the DCT basis 


vectors: 
N 
nu 


The coefficients {8,}4_, are called the DCT coefficients. They provide a representa- 
tion of y, because once we know {(,,}4_,, we can completely describe y because the 
basis vectors {a,}4_, are known and fixed. The situation is depicted in Figure 1.13. 

How can we compress images using DCT? In the 1970s, scientists found that most 
images have strong leading DCT coefficients but weak tail DCT coefficients. In other 
words, among the N = 64 £,,’s, only the first few are important. If we truncate the 
number of DCT coefficients, we can effectively compress the number of bits required 
to represent the image. 
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+ > Fo ie FP Bea 8 


LQ 


N 
y= a Pn&n 


n=1 


Figure 1.13: JPEG image compression is based on the concept of discrete cosine transform, which 
can be formulated as a matrix-vector problem. 


We hope by now you are convinced of the importance of matrices and vectors in the 
context of data science. They are not “yet another” subject but an essential tool you must 
know how to use. So, what are the technical materials you must master? Here we go. 


1.4.3. Inner products and norms 


We assume that you know the basic operations such as matrix-vector multiplication, taking 
the transpose, etc. If you have forgotten these, please consult any undergraduate linear 
algebra textbook such as Gilbert Strang’s Linear Algebra and its Applications. We will 
highlight a few of the most important operations for our purposes. 


Definition 1.4 (Inner product). Let x = [x1,...,2n]", and y = [y1,...,yn]?. The 
inner product x! y is 


N 
ca (1.14) 
ll 


Practice Exercise 1.7. Let x = [1, 0, —1]7, and y = [3, 2, 0]7. Find x7 y. 
Solution. The inner product is #7 y = (1)(3) + (0)(2) + (—1)(0) = 3. 


Inner products are important because they tell us how two vectors are correlated. 
Figure 1.14 depicts the geometric meaning of an inner product. If two vectors are correlated 
(ie., nearly parallel), then the inner product will give us a large value. Conversely, if the 
two vectors are close to perpendicular, then the inner product will be small. Therefore, the 
inner product provides a measure of the closeness/similarity between two vectors. 


ee 
~~ 


aly 


ee 2 


Figure 1.14: Geometric interpretation of inner product: We project one vector onto the other vector. 
The projected distance is the inner product. 
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Creating vectors and computing the inner products are straightforward in MATLAB. 
We simply need to define the column vectors x and y by using the command [] with ; to 
denote the next row. The inner product is done using the transpose operation x’ and vector 
multiplication *. 


MATLAB code to perform an inner product 
= [1 0 -1]; 


[3 2 0]; 
xX? *y; 


In Python, constructing a vector is done using the command np.array. Inside this 
command, one needs to enter the array. For a column vector, we write [[1] , [2] , [3]], with 
an outer [], and three inner [] for each entry. If the vector is a row vector, the one can omit 
the inner []’s by just calling np.array([1, 2, 3]). Given two column vectors x and y, 
the inner product is computed via np.dot(x.T,y), where np.dot is the command for inner 
product, and x.T returns the transpose of x. One can also call np.transpose(x), which is 
the same as x.T. 


# Python code to perform an inner product 
import numpy as np 
x = np.array([[1],[0],[-1]]) 


y = np.array([(3], [2], [0]]) 
zZ = np.dot(np.transpose(x) ,y) 
print (z) 


In data analytics, the inner product of two vectors can be useful. Consider the vectors 
in Table 1.1. Just from looking at the numbers, you probably will not see anything wrong. 
However, let’s compute the inner products. It turns out that #fa2 = —0.0031, whereas 
x} x3 = 2.0020. There is almost no correlation between a and a2, but there is a substan- 
tial correlation between x; and x3. What happened? The vectors #; and #2 are random 
vectors constructed independently and uncorrelated to each other. The last vector x3 was 
constructed by #3 = 2a, — 7/1000. Since x3 is completely constructed from a1, they have 
to be correlated. 

Ly £2 v3 
0.0006 —0.0011 —0.0020 
—0.0014 —0.0024 —0.0059 
—0.0034 0.0073 —0.0099 


0.0001 —0.0066 —0.0030 
0.0074 0.0046 0.0116 
0.0007 —0.0061 —0.0017 


Table 1.1: Three example vectors. 


One caveat for this example is that the naive inner product xix; is scale-dependent. 
For example, the vectors x3 = 2; and #3 = 1000a, have the same amount of correlation, 


25 


CHAPTER 1. MATHEMATICAL BACKGROUND 


but the simple inner product will give a larger value for the latter case. To solve this problem 
we first define the norm of the vectors: 


Definition 1.5 (Norm). Let x = [21,...,xn]" be a vector. The £)-norm of x is 


|e\|p = (: 4 (1.15) 


for any p= 1. 


The norm essentially tells us the length of the vector. This is most obvious if we consider 


the @9-norm: 
a 1/2 
lall = “) | 
i=1 


By taking the square on both sides, one can show that: ||a|/} = 27a. This is called the 
squared /2-norm, and is the sum of the squares. 

On MATLAB, computing the norm is done using the command norm. Here, we can 
indicate the types of norms, e.g., norm(x,1) returns the @;-norm whereas norm(x,2) returns 
the :-norm (which is also the default). 


% MATLAB code to compute the norm 
= [1 0 -1]; 
x_norm = norm(x) ; 


On Python, the norm command is listed in the np.linalg. To call the ¢;-norm, we use 
np.linalg.norm(x,1), and by default the @2-norm is np. linalg.norm(x). 


# Python code to compute the norm 
import numpy as np 

x = np.array([(1], [0], [-1]]) 
X_norm = np.linalg.norm(x) 


Using the norm, one can define an angle called the cosine angle between two vectors. 


Definition 1.6. The cosine angle between two vectors x and y is 


aly 
Ile Ilolyllo 


cos 8 = 


The difference between the cosine angle and the basic inner product is the normaliza- 
tion in the denominator, which is the product ||z||2||y||2. This normalization factor scales 
the vector x to a/||a|/2 and y to y/||y||2. The scaling makes the length of the new vector 
equal to unity, but it does not change the vector’s orientation. Therefore, the cosine angle 
is not affected by a very long vector or a very short vector. Only the angle matters. See 
Figure 1.15. 
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y 


= al 
lIyll2 2 


cos 9 = ————— 
|x \/2\|yI\l2 


x 


i 


\|x|l2 


Figure 1.15: The cosine angle is the inner product divided by the norms of the vectors. 


Going back to the previous example, after normalization we can show that the cosine 
angle between x; and x2 is cos6;,.2 = —0.0031, whereas the cosine angle between x; and 
x3 is cos 01,3 = 0.8958. There is still a strong correlation between x; and x3, but now using 
the cosine angle the value is between —1 and +1. 


Remark 1: There are other norms one can use. The ¢;-norm is useful for sparse models 
where we want to have the fewest possible non-zeros. The ¢;-norm of x is 


N 
lel = Soleil 
i=1 
which is the sum of absolute values. The @..-norm picks the maximum of {x1,...,2y}: 
N 1/p 
_ | p 
|2\|o = lim (>: “) 
‘= 
= max{x1,...,0n}, 


because as p — oo, only the largest element will be amplified. 
Remark 2: The standard ¢2-norm is a circle: Just consider « = [x1,22]7. The norm 


is |lal|2 = \/x7 + 23. We can convert the circle to ellipses by considering a weighted norm. 


Definition 1.7 (Weighted /,-norm square). Let x = [21,...,2Nn]’ and let W = 
diag(w1,...,wn) be a non-negative diagonal matrix. The weighted )-norm square of 
x is 


tlw =2° We 


N 
=e wa; (17) 
oe 


IN 


The geometry of the weighted f:-norm is determined by the matrix W. For example, 
if W = I (the identity operator), then ||x||},, = ||a||3, which defines a circle. If W is any 
“non-negative” matrix*, then ||a||%,, defines an ellipse. 


?The technical term for these matrices is positive semi-definite matrices. 
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In MATLAB, the weighted inner product is just a sequence of two matrix-vector mul- 
tiplications. This can be done using the command x’ *W*x as shown below. 


MATLAB code to compute the weighted norm 
[123; 456; 789]; 


= [2; -1; 1]; 
xX? *W*X 


In Python, constructing the matrix W and the column vector x is done using np.array. 
The matrix-vector multiplication is done using two np.dot commands: one for np.dot (W,x) 
and the other one for np.dot(x.T, np.dot(W,x)). 


# Python code to compute the weighted norm 
import numpy as np 
np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) 


= np.array([([2],[-1], [1]]) 
np.dot(x.T, np.dot(W,x)) 
print (z) 


1.4.4. Matrix calculus 


The last linear algebra topic we need to review is matrix calculus. As its name indicates, 
matrix calculus is about the differentiation of matrices and vectors. Why do we need differ- 
entiation for matrices and vectors? Because we want to find the minimum or maximum of 
a scalar function with a vector input. 

Let us go back to the crime rate problem we discussed earlier. Given the data, we 
want to find the model coefficients 6),..., Sy such that the variables can best explain the 
observation. In other words, we want to minimize the deviation between y and the prediction 
offered by our model: 


N 2 


y— S > Brin 


n=1 


minimize 
15-53 PN 


This equation is self-explanatory. The norm |\# — ||? measures the deviation. If y can 
be perfectly explained by {x,,}4_,, then the norm can eventually go to zero by finding a 


good set of {61,...,8n}. The symbol panies means to minimize the function by finding 
a eny ey, "a 
{61,...,6n}. Note that the norm is taking a vector as the input and generating a scalar as 


the output. It can be expressed as 


2 


N 
y— Sots 


n=1 


e(@) = 


7 


to emphasize this relationship. Here we define 8 = [(@1,...,8n]? as the collection of all 
coefficients. 

Given this setup, how would you determine @ such that the deviation is minimized? 
Our calculus teachers told us that we could take the function’s derivative and set it to zero 
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for scalar problems. It is the same story for vectors. What we do is to take the derivative of 


the error and set it equal to zero: 
d 


dp 
Now the question arises, how do we take the derivatives of ¢(@) when it takes a vector as 
input? If we can answer this question, we will find the best G. The answer is straightforward. 
Since the function has one output and many inputs, take the derivative for each element 
independently. This is called the scalar differentiation of vectors. 


e(B) = 0. 


Definition 1.8 (Scalar differentiation of vectors). Let f : RY — R be a differentiable 
scalar function, and let y = f(x) for some input « € RN. Then, 


dy/dx 
dy _ ul 1 
dx : 
dy/dxn 


As you can see from this definition, there is nothing conceptually challenging here. The only 
difficulty is that things can get tedious because there will be many terms. However, the good 
news is that mathematicians have already compiled a list of identities for common matrix 
differentiation. So instead of deriving every equation from scratch, we can enjoy the fruit of 
their hard work by referring to those formulae. The best place to find these equations is the 
Matrix Cookbook by Petersen and Pedersen.® Here, we will mention two of the most useful 
results. 


Example 1.8. Let y = «7 Az for any matrix A €¢ RN*%. Find a 


Solution. d 
ae (2" Ax) =Ar+A' gz. 
xz 


Now, if A is symmetric, ic, A = A’, then 


a T Ag) =2Az. 
dx 


Example 1.9. Let ¢ = ||Ax — y||3, where A € RN*% is symmetric. Find 4. 


Solution. First, we note that 


e = ||Aa — yl|2 
=a' A’ Ax—2y?Axv+y'y. 


Shttps://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook. pdf 
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Taking the derivative with respect to x yields 


ey Cy 2ATy 
dx 


=2A™(Ag —y). 


Going back to the crime rate problem, we can now show that 
de 
0= ally — XI? = 2X"(X8 — y). 


Therefore, the solution is 7 
B= (XTX) 1Xy. 

As you can see, if we do not have access to the matrix calculus, we will not be able to solve the 
minimization problem. (There are alternative paths that do not require matrix calculus, but 
they require an understanding of linear subspaces and properties of the projection operators. 
So in some sense, matrix calculus is the easiest way to solve the problem.) When we discuss 
the linear regression methods in Chapter 7, we will cover the interpretation of the inverses 
and related topics. 

In MATLAB and Python, matrix inversion is done using the command inv in MAT- 
LAB and np.linalg.inv in Python. Below is an example in Python. 


# Python code to compute a matrix inverse 
import numpy as np 
X = np.array([[1, 3], [-2, 7], [0, 1]]) 


XtX np.dot(X.T, X) 
XtXinv = np.linalg.inv(XtX) 
print (XtXinv) 


Sometimes, instead of computing the matrix inverse we are more interested in solving a 
linear equation XG = y (the solution of which is @ = (X7_X)~!Xy). In both MATLAB and 
Python, there are built-in commands to do this. In MATLAB, the command is \ (backslash). 


% MATLAB code to solve X beta = y 
= [1 3; -2 7; 0 1]; 
= [2; 1; 0]; 
= X\y; 


In Python, the built-in command is np.linalg.1stsq. 


# Python code to solve X beta = y 
import numpy as np 
X np.array([[1, 3], [-2, 7], [0, 1]]) 


y np.array([[2], [1], [0]]) 
beta np.linalg.1lstsq(X, y, rcond=None) [0] 
print (beta) 
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Closing remark: In this section, we have given a brief introduction to a few of the most 
relevant concepts in linear algebra. We will introduce further concepts in linear algebra in 
later chapters, such as eigenvalues, principal component analysis, linear transformations, 
and regularization, as they become useful for our discussion. 


1.5 Basic Combinatorics 


The last topic we review in this chapter is combinatorics. Combinatorics concerns the 
number of configurations that can be obtained from certain discrete experiments. It is useful 
because it provides a systematic way of enumerating cases. Combinatorics often becomes 
very challenging as the complexity of the event grows. However, you may rest assured that 
in this book, we will not tackle the more difficult problems of combinatorics; we will confine 
our discussion to two of the most basic principles: permutation and combination. 


1.5.1 Birthday paradox 


To motivate the discussion of combinatorics, let us start with the following problem. Suppose 
there are 50 people in a room. What is the probability that at least one pair of people have 
the same birthday (month and day)? (We exclude Feb. 29 in this problem.) 

The first thing you might be thinking is that since there are 365 days, we need at least 
366 people to ensure that one pair has the same birthday. Therefore, the chance that 2 of 
50 people have the same birthday is low. This seems reasonable, but let’s do a simulated 
experiment. In Figure 1.16 we plot the probability as a function of the number of people. 
For a room containing 50 people, the probability is 97%. To get a 50% probability, we just 
need 23 people! How is this possible? 


0 1 1 1 1 1 1 1 1 1 
0 10 20 30 40 50 60 70 80 90 100 
Number of people 


Figure 1.16: The probability for two people in a group to have the same birthday as a function of the 
number of people in the group. 


If you think about this problem more deeply, you will probably realize that to solve the 
problem, we must carefully enumerate all the possible configurations. How can we do this? 
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Well, suppose you walk into the room and sequentially pick two people. The probability 
that they have different birthdays is 
365 364 
P[The first 2 people have different birthdays] = — x —. 
[ I peop ve different bir ys] 365 * 368 
When you ask the first person to tell you their birthday, he or she can occupy any of the 
365 slots. This gives us ae. The second person has one slot short because the first person 
has taken it, and so the probability that he or she has a different birthday from the first 
person is 364 Note that this calculation is independent of how many people you have in the 
room because you are picking them sequentially. 
If you now choose a third person, the probability that they have different birthdays is 


365 . 364 . 363 
365 365 365. 


P/The first 3 people have different birthdays] = 


This process can be visualized in Figure 1.17. 


iJ 
T 1, 2, 3, 4, 5, 6, 7,8... 119, 120, 121, 122, 123 .... 363, 364, 365 } 365 choices 


a iJ - 
7 , 1, 2,3, 4, 5,6, 7,8... 119, 270, 121, 122, 123 .... 363, 364, 365 iz choices 


TTT 


Figure 1.17: The probability for two people to have the same birthday as a function of the number of 
people in the group. When there is only one person, this person can land on any of the 365 days. When 
there are two people, the first person has already taken one day (out of 365 days), so the second person 
can only choose 364 days. When there are three people, the first two people have occupied two days, 
so there are only 363 days left. If we generalize this process, we see that the number of configurations 
is 365 x 364 x --- x (365 — & +1), where k is the number of people in the room. 


(5, 6, 7,8...119, 7 121, 122, 123 .... 363, 364, 365 ~ an nes 


So imagine that you keep going down the list to the 50th person. The probability that 
none of these 50 people will have the same birthday is 


P[The first 50 people have different birthdays] 
= eee x ks x ooF Xr Xx al6 2.08 
365 365 365 365 
That means that the probability for 50 people to have different birthdays, the probability is 
as little as 3%. If you take the complement, you can show that with 97% probability, there 
is at least one pair of people having the same birthday. 
The general equation for this problem is now easy to see: 


365 x 364 x --- x (365 — k +1) 


365 x 365 x --- x 365 
365! 1 


(365 — k)! < 365"" 


P|The first k people have different birthdays] = 


32 


1.5. BASIC COMBINATORICS 


Gas 
365 options. We shall discuss this operation shortly. 

Why is the probability so high with only 50 people while it seems that we need 366 
people to ensure two identical birthdays? The difference is the notion of probabilistic and 
deterministic. The 366-people argument is deterministic. If you have 366 people, you are 
certain that two people will have the same birthday. This has no conflict with the proba- 
bilistic argument because the probabilistic argument says that with 50 people, we have a 
97% chance of getting two identical birthdays. With a 97% success rate, you still have a 
3% chance of failing. It is unlikely to happen, but it can still happen. The more people you 
put into the room, the stronger guarantee you will have. However, even if you have 364 
people and the probability is almost 100%, there is still no guarantee. So there is no conflict 
between the two arguments since they are answering two different questions. 

Now, let’s discuss the two combinatorics questions. 


The first term in our equation, ~28.,, is called the permutation of picking k days from 
(365—hy! 8 


1.5.2 Permutation 


Permutation concerns the following question: 


Consider a set of n distinct balls. Suppose we want to pick k balls from the set without 
replacement. How many ordered configurations can we obtain? 


Note that in the above question, the word “ordered” is crucial. For example, the set 
A = {a,b,c} can lead to 6 different ordered configurations 


(a,b,c), (a,c, 0), (b,a,c), (b,c,a), (c,a,b), (c,b,a). 


As a simple illustration of how to compute the permutation, we can consider a set of 
5 colored balls as shown in Figure 1.18. 


3 choices 


4 choices 


Gy 5 choices 


Figure 1.18: Permutation. The number of choices is reduced in every stage. Therefore, the total number 
isn x (n—1)x--- x (n—k +1) if there are k stages. 


If you start with the base, which contains five balls, you will have five choices. At one 
level up, since one ball has already been taken, you have only four choices. You continue 
the process until you reached the number of balls you want to collect. The number of 
configurations you have generated is the permutation. Here is the formula: 
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Theorem 1.6. The number of permutations of choosing k out of n is 


where n! = n(n — 1)(n — 2)--- 


Proof. Let’s list all possible ways: 
Which ball to pick Number of choices Why? 


The 1st ball n No has been picked, so we 
have n choices 

The 2nd ball n—-1 The first ball has been 
picked 

The 3rd ball n—2 The first two balls have 


been picked 


The kth ball n—-k+1 The first k — 1 balls have 
been picked 


Total: n(n —1)---(n—k+1) 
eS as 


The total number of ordered configurations is n(n — 1)---(n —k +1). This simplifies 
to 


n(n —1)(n — 2)---(n-—k+1) 


Practice Exercise 1.8. Consider a set of 4 balls {1,2,3,4}. We want to pick two 
balls at random without replacement. The ordering matters. How many permutations 
can we obtain? 


Solution. The possible configurations are (1,2), (2,1), (1,3), (3,1), (1,4), (4,1), (2,3), 
(3,2), (2,4), (4,2), (3,4), (4,3). So totally there are 12 configurations. We can also 


verify this number by noting that there are 4 balls altogether and so the number 

of choices for picking the first ball is 4 and the number of choices for picking the 

second ball is (4— 1) = 3. Thus, the total is 4-3 = 12. Referring to the formula, this 

result coincides with the theorem, which states that the number of permutations is 
A! 4.3-2:1 __ 12. 


(222) 
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1.5.3. Combination 


Another operation in combinatorics is combination. Combination concerns the following 
question: 


Consider a set of n distinct balls. Suppose we want to pick k balls from the set without 
replacement. How many unordered configurations can we obtain? 


Unlike permutation, combination treats a subset of balls with whatever ordering as 
one single configuration. For example, the subset (a,b,c) is considered the same as (a,c, b) 
or (b,c, a), ete. 

Let’s go back to the 5-ball exercise. Suppose you have picked orange, green, and light 
blue. This is the same combination as if you have picked {green, orange, and light blue}, 
or {green, light blue, and orange}. Figure 1.19 lists all the six possible configurations for 
these three balls. So what is combination? Combination needs to take these repeated cases 
into account. 


eee : aneeeseneeeneescassccsssscsssensscnssscssses 


@ > 3 choices H 
\ \ 222 <e 
& a 4 choices nee ae 


eee _ 


\ | eee : 
& & & ® ®@: peer eee | Beles 2 cles, oie 


possible combinations 


Figure 1.19: Combination. In this problem, we are interested in picking 3 colored balls out of 5. This 
will give us 5 x 4 x 3 = 60 permutations. However, since we are not interested in the ordering, some of 
the permutations are repeated. For example, there are 6 combos of (green, light blue, orange), which is 
computed from 3 x 2 x 1. Dividing 60 permutations by these 6 choices of the orderings will give us 10 
distinct combinations of the colors. 


Theorem 1.7. The number of combinations of choosing k out of n is 


n! 


k(n — k)! 


where n! = n(n — 1)(n — 2)---3-2-1. 


Proof. We start with the permutation result, which gives us wom permutations. Note that 
every permutation has exactly k balls. However, while these k balls can be arranged in any 
order, in combination, we treat them as one single configuration. Therefore, the task is to 
count the number of possible orderings for these k balls. 

To this end, we note that for a set of k balls, there are in total k! possible ways of 
ordering them. The number k! comes from the following table. 
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See 
Which ball to pick Number of choices 


The Ist ball k 
The 2nd ball k-1 
The kth ball 1 
Total: k(k—1)---3-2-1 


ee 
Therefore, the total number of orderings for a set of k balls is k!. Since permutation 
gives us moa and every permutation has k! repetitions due to ordering, we divide the 
number by k!. Thus the number of combinations is 


n! 


k(n — kyl 


Practice Exercise 1.9. Consider a set of 4 balls {1,2,3,4}. We want to pick two 
balls at random without replacement. The ordering does not matter. How many com- 
binations can we obtain? 


Solution. The permutation result gives us 12 permutations. However, among all these 
12 permutations, there are only 6 distinct pairs of numbers. We can confirm this by 
noting that since we picked 2 balls, there are exactly 2 possible orderings for these 2 
balls. Therefore, we have 2 = 6 number of combinations. Using the formula of the 
theorem, we check that the number of combinations is 

4) 4-3-2-1 


N42! @- 121) ~>° 


Example 1.10. (Ross, 8th edition, Section 1.6) Consider the equation 


t1+a2++--+2K =N, 


where {x;,} are positive integers. How many combinations of solutions of this equation 
are there? 


Solution. We can determine the number of combinations by considering the figure 
below. The integer N can be modeled as N balls in an urn. The number of variables Kr 
is equivalent to the number of colors of these balls. Since all variables are positive, the 
problem can be translated to partitioning the N balls into K buckets. This, in turn, 
is the same as inserting K — 1 dividers among N — 1 holes. Therefore, the number of 


combinations is 
NSA (N — 1)! 
K-11) (K-1)\(N-K) 
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For example, if N = 16 and Kk = 4, then the number of solutions is 


16-1 15! 
= —— = 455. 
(io1) = sia 


Figure 1.20: One possible solution for N = 16 and K = 4. In general, the problem is equivalent 
to inserting kK — 1 dividers among N — 1 balls. 


Closing remark. Permutations and combinations are two ways to enumerate all the pos- 
sible cases. While the conclusions are probabilistic, as the birthday paradox shows, permu- 
tation and combination are deterministic. We do not need to worry about the distribution 
of the samples, and we are not taking averages of anything. Thus, modern data analysis 
seldom uses the concepts of permutation and combination. Accordingly, combinatorics does 
not play a large role in this book. 

Does it mean that combinatorics is not useful? Not quite, because it still provides us 
with powerful tools for theoretical analysis. For example, in binomial random variables, we 
need the concept of combination to calculate the repeated cases. The Poisson random vari- 
able can be regarded as a limiting case of the binomial random variable, and so combination 
is also used. Therefore, while we do not use the concepts of permutation per se, we use them 
to define random variables. 


1.6 Summary 


In this chapter, we have reviewed several background mathematical concepts that will be- 
come useful later in the book. You will find that these concepts are important for under- 
standing the rest of this book. When studying these materials, we recommend not just 
remembering the “recipes” of the steps but focusing on the motivations and intuitions 
behind the techniques. 

We would like to highlight the significance of the birthday paradox. Many of us come 
from an engineering background in which we were told to ensure reliability and guarantee 
success. We want to ensure that the product we deliver to our customers can survive even 
in the worst-case scenario. We tend to apply deterministic arguments such as requiring 366 
people to ensure complete coverage of the 365 days. In modern data analysis, the worst-case 
scenario may not always be relevant because of the complexity of the problem and the cost 
of such a warranty. The probabilistic argument, or the average argument, is more reasonable 
and cost-effective, as you can see from our analysis of the birthday problem. The heart of 
the problem is the trade-off between how much confidence you need versus how much effort 
you need to expend. Suppose an event is unlikely to happen, but if it happens, it will be 
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a disaster. In that case, you might prefer to be very conservative to ensure that such a 
disaster event has a low chance of happening. Industries related to risk management such 
as insurance and investment banking are all operating under this principle. 
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1.8. PROBLEMS 


1.8 Problems 


Exercise 1. (VIDEO SOLUTION) 

(a) Show that 
k=0 
for any 0 <r <1. Evaluate pyr’. 


(b) Using the result of (a), evaluate 


L+2r+3r274---. 


(c) Evaluate the sums 


Exercise 2. (VIDEO SOLUTION) 
Recall that 


Evaluate 


Exercise 3. (VIDEO SOLUTION) 
Evaluate the integrals 


(a) 


(b) 
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Exercise 4. 


(a) Compute the result of the following matrix vector multiplication using Numpy. Submit 
your result and codes. 


(b) Plot a sine function on the interval [—7, 7] with 1000 data points. 


(c) Generate 10,000 uniformly distributed random numbers on interval [0, 1). 


Use matplotlib.pyplot.hist to generate a histogram of all the random numbers. 


Exercise 5. 
Calculate 


Exercise 6. 
Let 


(a) Find ©", the inverse of ¥. 
(b) Find |X|, the determinant of ©. 


(c) Simplify the two-dimensional function 


F(a) = sg exw {Fe — WE Mew) 


(d) Use matplotlib.pyplot.contour, plot the function f(a) for the range [—3,3] x 


Exercise 7. 

Out of seven electrical engineering (EE) students and five mechanical engineering (ME) 
students, a committee consisting of three EEs and two MEs is to be formed. In how many 
ways can this be done if 


(a) any of the EEs and any of the MEs can be included? 
(b) one particular EE must be on the committee? 


(c) two particular MEs cannot be on the committee? 
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Exercise 8. 

Five blue balls, three red balls, and three white balls are placed in an urn. Three balls are 
drawn at random without regard to the order in which they are drawn. Using the counting 
approach to probability, find the probability that 


(a) one blue ball, one red ball, and one white ball are drawn. 
(b) all three balls drawn are red. 


(c) exactly two of the balls drawn are blue. 


Exercise 9. 
A collection of 26 English letters, a-z, is mixed in a jar. Two letters are drawn at random, 
one after the other. 


(a) What is the probability of drawing a vowel (a,e,i,0,u) and a consonant in either order? 


(b) Write a MATLAB / Python program to verify your answer in part (a). Randomly 
draw two letters without replacement and check whether one is a vowel and the other 
is a consonant. Compute the probability by repeating the experiment 10000 times. 


Exercise 10. 
There are 50 students in a classroom. 


(a) What is the probability that there is at least one pair of students having the same 
birthday? Show your steps. 


(b) Write a MATLAB / Python program to simulate the event and verify your answer 
in (a). Hint: You probably need to repeat the simulation many times to obtain a 
probability. Submit your code and result. 


You may assume that a year only has 365 days. You may also assume that all days have an 
equal likelihood of being taken. 
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Chapter 2 


Probability 


Data and probability are inseparable. Data is the computational side of the story, whereas 
probability is the theoretical side of the story. Any data science practice must be built on 
the foundation of probability, and probability needs to address practical problems. However, 
what exactly is “probability”? Mathematicians have been debating this for centuries. The 
frequentists argue that probability is the relative frequency of an outcome. For example, 
flipping a fair coin has a 1/2 probability of getting a head because if you flip the coin 
infinitely many times, you will have half of the time getting a head. The Bayesians argue 
that probability is a subjective belief. For example, the probability of getting an A in a 
class is subjective because no one would want to take a class infinitely many times to obtain 
the relative frequency. Both the frequentists and Bayesians have valid points. However, the 
differentiation is often non-essential because the context of your problem will force you 
to align with one or the other. For example, when you have a shortage of data, then the 
subjectivity of the Bayesians allows you to use prior knowledge, whereas the frequentists 
tell us how to compute the confidence interval of an estimate. 

No matter whether you prefer the frequentist’s view or the Bayesian’s view, there is 
something more fundamental thanks to Andrey Kolmogorov (1903-1987). The development 
of this fundamental definition will take some effort on our part, but if we distill the essence, 
we can summarize it as follows: 


Probability is a measure of the size of a set. 


This sentence is not a formal definition; instead, it summarizes what we believe to be the 
essence of probability. We need to clarify some puzzles later in this chapter, but if you can 
understand what this sentence means, you are halfway done with this book. To spell out the 
details, we will describe an elementary problem that everyone knows how to solve. As we 
discuss this problem, we will highlight a few key concepts that will give you some intuitive 
insights into our definition of probability, after which we will explain the sequence of topics 
to be covered in this chapter. 


Prelude: Probability of throwing a die 


Suppose that you have a fair die. It has 6 faces: {1, 2,3, 4,5,6}. What is the probability that 
you get a number that is “less than 5” and is “an even number”? This is a straightforward 
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problem. You probably have already found the answer, which is 2 because “less than 5” and 
“an even number” means {2,4}. However, let’s go through the thinking process slowly by 
explicitly writing down the steps. 

First of all, how do we know that the denominator in A is 6? Well, because there are 
six faces. These six faces form a set called the sample space. A sample space is the set 
containing all possible outcomes, which in our case is 0 = {1,2,3,4,5,6}. The denominator 
6 is the size of the sample space. 

How do we know that the numerator is 2? Again, implicitly in our minds, we have 
constructed two events: £, = “less than 5” = {1,2,3,4}, and Ey = “an even number” 
= {2,4,6}. Then we take the intersection between these two events to conclude the event 
E = {2,4}. The numerical value “2” is the size of this event LE. 

So, when we say that “the probability is 2,” we are saying that the size of the event 
E relative to the sample space (Q is the ratio 7. This process involves measuring the size 
of EF and 2. In this particular example, the measure we use is a “counter” that counts the 
number of elements. 

This example shows us all the necessary components of probability: (i) There is a 
sample space, which is the set that contains all the possible outcomes. (ii) There is an event, 
which is a subset inside the sample space. (iii) Two events E; and EF can be combined to 
construct another event £ that is still a subset inside the sample space. (iv) Probability is 
a number assigned by certain rules such that it describes the relative size of the event E 
compared with the sample space (2. So, when we say that probability is a measure of the 
size of a set, we create a mapping that takes in a set and outputs the size of that set. 


Organization of this chapter 


As you can see from this example, since probability is a measure of the size of a set, we need 
to understand the operations of sets to understand probability. Accordingly, in Section 2.1 
we first define sets and discuss their operations. After learning these basic concepts, we move 
on to define the sample space and event space in Section 2.2. There, we discuss sample spaces 
that are not necessarily countable and how probabilities are assigned to events. Of course, 
assigning a probability value to an event cannot be arbitrary; otherwise, the probabilities 
may be inconsistent. Consequently, in Section 2.3 we introduce the probability axioms and 
formalize the notion of measure. Section 2.4 consists of a trio of topics that concern the 
relationship between events using conditioning. We discuss conditional probability in Section 
2.4.1, independence in Section 2.4.2, and Bayes’ theorem in Section 2.4.3. 


2.1 Set Theory 


2.1.1 Why study set theory? 


In mathematics, we are often interested in describing a collection of numbers, for example, a 
positive interval [a, b] on the real line or the ordered pairs of numbers that define a circle on 
a graph with two axes. These collections of numbers can be abstractly defined as sets. In a 
nutshell, a set is simply a collection of things. These things can be numbers, but they can also 
be alphabets, objects, or anything. Set theory is a mathematical tool that defines operations 
on sets. It provides the basic arithmetic for us to combine, separate, and decompose sets. 
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Why do we start the chapter by describing set theory? Because probability is a measure 
of the size of a set. Yes, probability is not just a number telling us the relative frequency 
of events; it is an operator that takes a set and tells us how large the set is. Using the 
example we showed in the prelude, the event “even number” of a die is a set containing 
numbers {2, 4,6}. When we apply probability to this set, we obtain the number 3, as shown 
in Figure 2.1. Thus sets are the foundation of the study of probability. 


P =5 
\ \ 


a measure 
aset anumber between 0 and 1 


Figure 2.1: Probability is a measure of the size of a set. Whenever we talk about probability, it has to 
be the probability of a set. 


2.1.2 Basic concepts of a set 


Definition 2.1 (Set). A set is a collection of elements. We denote 


A 161363). ,6n} 


as a set, where €; is the ith element in the set. 


In this definition, A is called a set. It is nothing but a collection of elements £,,...,&,. What 
are these €;’s? They can be anything. Let’s see a few examples below. 


Example 2.1(a). A = {apple, orange, pear} is a finite set. 
Example 2.1(b). A = {1,2,3,4,5,6} is a finite set. 

Example 2.1(c). A = {2,4,6,8,...} is a countable but infinite set. 
Example 2.1(d). A= {x|0 <a < 1} is a uncountable set. 


To say that an element € is drawn from A, we write € € A. For example, the number 1 
is an element in the set {1,2,3}. We write 1 € {1,2,3}. There are a few common sets that 
we will encounter. For example, 


Example 2.2(a). R is the set of all real numbers including oo. 


Example 2.2(b). R? is the set of ordered pairs of real numbers. 
Example 2.2(c). [a,b] = {x|a < x < }} is a closed interval on R. 
Example 2.2(d). (a,b) = {x|a < x < }} is an open interval on R. 


Example 2.2(e). (a,b] = {x|a <x < b} is a semi-closed interval on R. 
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ax<ax<b a<a<b a<a<b 
—t——+— ————— — 4 Ct > 


Figure 2.2: From left to right: a closed interval, a semi-closed (or semi-open) interval, and an open 
interval. 


Sets are not limited to numbers. A set can be used to describe a collection of functions. 


Example 2.3.A={f:R—R| f(x) =azr-+b, a,b € R}. This is the set of all straight 
lines in 2D. The notation f : R — R means that the function f takes an argument 
from R and sends it to another real number in R. The definition f(z) = ax + b says 
that f is taking the specific form of ax + b. Since the constants a and b can be any 
real number, the equation f(a) = ax + b enumerates all possible straight lines in 2D. 
See Figure 2.3(a). 


Example 2.4. A = {f : R > [-1,]] | f(t) = cos(wot + 0), 0 € [0,27]}. This is 
the set of all cosine functions of a fixed carrier frequency wo. The phase 0, however, 
is changing. Therefore, the equation f(t) = cos(wot + 6) says that the set A is the 
collection of all possible cosines with different phases. See Figure 2.3(b). 


Figure 2.3: (a) The set of straight lines A= {f:R—>R| f(a) =ax+b, a,b € R}. (b) The set of 
phase-shifted cosines A = {f : R > [—1,1] | f(t) = cos(wot + 8), 6 € [0, 27]}. 


A set can also be used to describe a collection of sets. Let A and B be two sets. Then 
C = {A, B} is a set of sets. 


Example 2.5. Let A = {1,2} and B = {apple, orange}. Then 


C = {A, B} = {{1, 2}, {apple, orange}} 


is a collection of sets. Note that here we are not saying C is the union of two sets. We 
are only saying that C is a collection of two sets. See the next example. 
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Example 2.6. Let A = {1,2} and B = {3}, then C = {A, B} means that 


C = {{1, 2}, {3}}. 


Therefore C contains only two elements. One is the set {1,2} and the other is the set 
{3}. Note that {{1, 2}, {3}} # {1, 2,3}. The former is a set of two sets. The latter is a 
set of three elements. 


2.1.3. Subsets 


Given a set, we often want to specify a portion of the set, which is called a subset. 


Definition 2.2 (Subset). B is a subset of A if for any € € B, € is also in A. We 
write 


BCA (222) 
to denote that B is a subset of A. 


B is called a proper subset of A if B is a subset of A and B 4 A. We denote a proper subset 
as B C A. Two sets A and B are equal if and only if AC Band BC A. 


Example 2.7. 


e If A= {1,2,3,4,5,6}, then B = {1,3,5} is a proper subset of A. 
e If A= {1,2}, then B = {1,2} is an improper subset of A. 
e If A= {t|t> 0}, then B = {t | t > 0} is a proper subset of A. 


Practice Exercise 2.1. Let A = {1,2,3}. List all the subsets of A. 
Solution. The subsets of A are: 


A = {0, {1}, {2}, {3}, (1, 2}, {1, 3}, {2, 3}, (1, 2, 3}F. 


Practice Exercise 2.2. Prove that two sets A and B are equal if and only if A C B 
and BCA. 


Solution. Suppose A C B and B C A. Assume by contradiction that A #4 B. Then 
necessarily there must exist an x such that « € A but x ¢ B (or vice versa). But 
A C B means that x € A will necessarily be in B. So it is impossible to have x ¢ B. 
Conversely, suppose that A = B. Then any x € A will necessarily be in B. Therefore, 
we have A C B. Similarly, if A = B then any x € B will be in A, and so B C A. 
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2.1.4 Empty set and universal set 


Definition 2.3 (Empty Set). A set is empty if it contains no element. We denote 


an empty set as 


A=6. (2.3) 


A set containing an element 0 is not an empty set. It is a set of one element, {0}. The 
number of elements of the empty set is 0. The empty set is a subset of any set, i.e., 0 C A 
for any A. We use C because A could also be an empty set. 


Example 2.8(a). The set A = {x| sinz > 1} is empty because no x € R can make 
sing > 1. 


Example 2.8(b). The set A = {x|z > 5and a < 1} is empty because the two 
conditions x > 5 and x < 1 are contradictory. 


Definition 2.4 (Universal Set). The universal set is the set containing all elements 
under consideration. We denote a universal set as 


AS, (2.4) 


The universal set 2 contains itself, i.e, Q C Q. The universal set is a relative concept. 
Usually, we first define a universal set 2 before referring to subsets of Q. For example, we 
can define 2. = R and refer to intervals in R. We can also define 22 = [0,1] and refer to 
subintervals inside [0, 1]. 


2.1.5 Union 


We now discuss basic set operations. By operations, we mean functions of two or more sets 
whose output value is a set. We use these operations to combine and separate sets. Let us 
first consdier the union of two sets. See Figure 2.4 for a graphical depiction. 


Definition 2.5 (Finite Union). The union of two sets A and B contains all elements 
in A or in B. That is, 
AUB={E|Ee Aorée B}. (2.5) 


As the definition suggests, the union of two sets connects the sets using the logical operator 
“or”. Therefore, the union of two sets is always larger than or equal to the individual sets. 


Example 2.9(a). If A = {1,2}, B = {1,5}, then AU B = {1,2,5}. The overlapping 
element 1 is absorbed. Also, note that AU B 4 {{1, 2}, {1,5}}. The latter is a set of 
sets. 


Example 2.9(b). If A = (3, 4], B = (3.5, 00), then AU B = (3,00). 


Example 2.9(c). If A={f:R—-R| f(x) =ar} and B={f:R-—R| f(x) =}, 
then AU B = a set of sloped lines with a slope a plus a set of constant lines with 
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height b. Note that AUBA{f:R—-R| f(x) = ax + }} because the latter is a set of 
sloped lines with arbitrary y-intercept. 


Example 2.9(d). If A = {1,2} and B=9, then AU B = {1,2}. 
Example. If A = {1,2} and B=, then AUB=2. 


AUB 


A B 


Figure 2.4: The union of two sets contains elements that are either in A or B or both. 


The previous example can be generalized in the following exercise. What it says is that 
if A is a subset of another set B, then the union of A and B is just B. Intuitively, this should 
be straightforward because whatever you have in A is already in B, so the union will just 
be B. Below is a formal proof that illustrates how to state the arguments clearly. You may 
like to draw a picture to convince yourself that the proof is correct. 


Practice Exercise 2.3: Prove that if AC B, then AUB = B. 


Solution: We will show that AUB C Band BC AUB. Let € € AUB. Then € must 
be inside either A or B (or both). In any case, since we know that A C B, it holds 
that if € € A then € must also be in B. Therefore, for any £ € AU B we have € € B. 


This shows AU B C B. Conversely, if € € B, then € must be inside AU B because 
AUB isa larger set than B. So if € € B then € € AUB and hence B C AUB. Since 
AUB is a subset of B or equal to B, and B is a subset of AU B or equal to AU B, it 
follows that AU B= B. 


What should we do if we want to take the union of an infinite number of sets? First, 
we need to define the concept of an infinite union. 


Definition 2.6 (Infinite Union). For an infinite sequence of sets A, Ao,..., the in- 
finite union is defined as 


U An = {& | €€ An for at least one n that is finite.} . (2.6) 
ale 


An infinite union is a natural extension of a finite union. It is not difficult to see that 


€€Aor€E€CB <=> € isinat least one of Aand B. 
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Similarly, an infinite union means that 
€€ A, or €€ Ap or €€ Ay... <— > € is inat least one of Aj, Ag, As, .... 


The finite n requirement says that we only evaluate the sets for a finite number of n’s. This 
n can be arbitrarily large, but it is finite. Why are we able to do this? Because the concept 
of an infinite union is to determine A,,, which is the limit of a sequence. Like any sequence 
of real numbers, the limit of a sequence of sets has to be defined by evaluating the instances 
of all possible finite cases. 


Consider a sequence of sets A, = [-1, 1- +], for n = 1,2,.... For example, A, = 
[-1, 0], A2 = [-1,$], As = [-1, 3], As = [-1, 3], etc. 


x x x 
Se So 


1 
-1 y-t -1 1 
10 100 


Figure 2.5: The infinite union of U~_, [-1, 1=— 4). No matter how large n gets, the point 1 is never 
included. So the infinite union is [—1, 1) 


To take the infinite union, we know that the set [—1, 1) is always included, because the 
right-hand limit 1 — 4 approaches 1 as n approaches oo. So the only question concerns the 
number 1. Should 1 be included? According to the definition above, we ask: Is 1 an element 
of at least one of the sets A,, Ag, ..., An? Clearly it is not: 1 ¢ Ay, 1 ¢ Ag, .... In fact, 
1 ¢ A, for any finite n. Therefore 1 is not an element of the infinite union, and we conclude 


that 7 - 
1 
Ua=U [-1.1- | ~[-1,1). 


n=1 


Practice Exercise 2.4. Find the infinite union of the sequences where (a) A, = 
[—1,1-— 4), (b) An = (-1,1- 2]. 


Solution. (a) U7, An = [-1,1). (b) UP, An = (-1,1). 


2.1.6 Intersection 


The union of two sets is based on the logical operator or. If we use the logical operator and, 
then the result is the intersection of two sets. 


Definition 2.7 (Finite Intersection). The intersection of two sets A and B contains 
all elements in A and in B. That is, 


ANB={é|EEA and €€ B}. (2.7) 


Figure 2.6 portrays intersection graphically. Intersection finds the common elements of the 
two sets. It is not difficult to show that AN BC Aand ANBCB. 
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ANB 


A B 


Figure 2.6: The intersection of two sets contains elements in both A and B. 


Example 2.10(a). If A = {1,2,3,4}, B = {1,5,6}, then AN B= {1}. 
Example 2.10(b). If A = {1,2}, B = {5,6}, then ANB=0. 
Example 2.10(c). If A = (3, 4], B = [3.5, 0), then AN B = [3.5, 4]. 
Example 2.10(d). If A = (3,4), B= 9, then AN B= 0. 

Example 2.10(e). If A = (3,4], B=Q, then AN B = (3, 4]. 


Example 2.11. If A={f:R—-R|f(c) =ar} and B={f:R—>R| f(z) = 5}, then 
ANB = the intersection of a set of sloped lines with a slope a and a set of constant lines 
with height b. The only line that can satisfy both sets is the line f(a) = 0. Therefore, 
ANB={f| f(z) = 0}. 


Example 2.12. If A = {{1}, {2}} and B = {{2,3}, {4}}, then AN B = 9. This is 
because A is a set containing two sets, and B is a set containing two sets. The two sets 
{2} and {2,3} are not the same. Thus, A and B have no elements in common, and so 
ANB=0. 


Similarly to the infinite union, we can define the concept of infinite intersection. 


Definition 2.8 (Infinite Intersection). For an infinite sequence of sets Aj, Ao,.. 
the infinite intersection is defined as 


er 


() A, = {€ | €€ A, for every finite n.} (2.8) 
ot 


To understand this definition, we note that 
€€Aandé€B <= € isinevery one of Aand B. 
As a result, it follows that 


€€ A, and €€ Ap and €€ As... <— > € isin every one of Aj, Ao, Az, .... 
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Since the infinite intersection requires that € is in every one of A;, Ag, ..., An, if there is a 
set A; that does not contain €, the infinite intersection is an empty set. 
Consider the problem of finding the infinite intersection of (\7<_, An, where 


1 
Aes o. 14 ) | 
n 
We note that the sequence of sets is [0,2], [0, 1.5], [0, 1.33], .... As n + oo, we note that 
the limit is either [0,1) or [0,1]. Should the right-hand limit 1 be included in the infinite 
intersection? According to the definition above, we know that 1 € A,, 1 € Ag, ..., 16 An 
for any finite n. Therefore, 1 is included and so 
soe Q Ota foal: 
n=1 n=1 ue ) 


Figure 2.7: The infinite intersection of (>, (0, 1+ =), No matter how large n gets, the point 1 is 
never included. So the infinite intersection is [0, 1] 


Practice Exercise 2.5. Find the infinite intersection of the sequences where (a) 
A, = [0,1+ 4], (b) An = (0,144), (c) An = [0,1 — 4), (d) An = [0,1- 2]. 


Solution. 


2.1.7 Complement and difference 


Besides union and intersection, there is a third basic operation on sets known as the com- 
plement. 


Definition 2.9 (Complement). The complement of a set A is the set containing all 
elements that are in Q but not in A. That is, 


Ae = {€| €€Q andé ¢ A}. (2.9) 


Figure 2.8 graphically portrays the idea of a complement. The complement is a set that 
contains everything in the universal set that is not in A. Thus the complement of a set is 
always relative to a specified universal set. 
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a * 
A B 


Figure 2.8: [Left] The complement of a set A contains all elements that are not in A. [Right] The 
difference A\B contains elements that are in A but not in B. 


Example 2.13(a). Let A = {1,2,3} and Q = {1,2,3,4,5,6}. Then A° = {4,5,6}. 
Example 2.13(b). Let A = {even integers} and 2 = {integers}. Then A° = {odd 
integers}. 

Example 2.13(c). Let A = {integers} and Q =R. Then A° ={any real number that 
is not an integer}. 

Example 2.13(d). Let A = [0,5) and Q =R. Then A‘ = (—oo, 0) U [5, 00). 
Example 2.13(e). Let A= R and 2 =R. Then A° = 0. 


The concept of the complement will help us understand the concept of difference. 


Definition 2.10 (Difference). The difference A\B is the set containing all elements 
in A but not in B. 
A\B={€|EeAand€ ¢ B}. (2.10) 


Figure 2.8 portrays the concept of difference graphically. Note that A\B 4 B\A. The former 
removes the elements in B whereas the latter removes the elements in A. 


Example 2.14(a). Let A = {1,3,5,6} and B = {2,3,4}. Then A\B = {1,5,6} and 
B\A= {2,4}, 


Example 2.14(b). Let A = [0,1], B = [2,3], then A\B = [0,1], and B\A = [2,3]. 


This example shows that if the two sets do not overlap, there is nothing to subtract. 


Example 2.14(c). Let A = [0,1], B =R, then A\B = 0, and B\A = (—00, 0)U(1, ov). 
This example shows that if one of the sets is the universal set, then the difference will 
either return the empty set or the complement. 
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A B A B 


Figure 2.9: [Left] A and B are overlapping. [Right] A and B are disjoint. 


Practice Exercise 2.6. Show that for any two sets A and B, the differences A\B 
and B\A never overlap, i.e., (A\B)M (B\A) = 90. 


Solution. Suppose, by contradiction, that the intersection is not empty so that there 


exists an € € (A\B)M(B\A). Then, by the definition of intersection, € is an element 
of (A\B) and (B\A). But if € is an element of (A\B), it cannot be an element of B. 
This implies that € cannot be an element of (B\A) since it is a subset of B. This is a 
contradiction because we just assumed that the € can live in both (A\B) and (B\A). 


Difference can be defined in terms of intersection and complement: 


Theorem 2.1. Let A and B be two sets. Then 


A\B = An Be 


Proof. Let « € A\B. Then x € A and « ¢ B. Since x ¢ B, we have x € B°. Therefore, 
x € Aand « € B*. By the definition of intersection, we have « € AN B°. This shows 
that A\B C AN BS. Conversely, let « € AN BS. Then, x € A and x € B*, which implies 
that « € A and x ¢ B. By the definition of A\B, we have that « € A\B. This shows that 
AN B°CA\B. 


2.1.8 Disjoint and partition 


It is important to be able to quantify situations in which two sets are not overlapping. In 
this situation, we say that the sets are disjoint. 


Definition 2.11 (Disjoint). Two sets A and B are disjoint if 
ANB=0. (2.12) 


For a collection of sets {A,, Ao,..., An}, we say that the collection is disjoint if, for 
any pair i # j, 


A; A; =. (2.13) 


A pictorial interpretation can be found in Figure 2.9. 
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Example 2.15(a). Let A = {x > 1} and B = {x < 0}. Then A and B are disjoint. 


Example 2.15(b). Let A = {1,2,3} and B = 9. Then A and B are disjoint. 
Example 2.15(c). Let A = (0,1) and B = [1,2). Then A and B are disjoint. 


With the definition of disjoint, we can now define the powerful concept of partition. 


Definition 2.12 (Partition). A collection of sets {A,,...,An} is a partition of the 
universal set Q if it satisfies the following conditions: 


e (non-overlap) {Aj,...,An} is disjoint: 


Aj M Aj =. 


e (decompose) Union of {A,,...,An} gives the universal set: 


ipo 
a 


In plain language, a partition is a collection of non-overlapping subsets whose union is 
the universal set. Partition is important because it is a decomposition of 2 into a smaller 
subset, and since these subsets do not overlap, they can be analyzed separately. Partition 
is a handy tool for studying probability because it allows us to decouple complex events by 
treating them as isolated sub-events. 


Figure 2.10: A partition of 2 contains disjoint subsets of which the union gives us 2. 


Example 2.16. Let 2 = {1,2,3,4,5,6}. The following sets form a partition: 


Ay = {1,2, 3}, Ag = {4, 5}, A3 = {6} 


Example 2.17. Let 0 = {1,2,3,4,5,6}. The collection 


Ay = 1 2, 3}, Ag = {4, 5}, A3 = {5, 6} 


does not form a partition, because Ag M A3 = {5}. 
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If {Ai, Ao,..., An} forms a partition of the universal set 2, then for any B C 2, we 
can decompose B into n disjoint subsets: BN A,, BN Ag, ... BOM An. Two properties hold: 
e BO A; and BN A; are disjoint if i # j. 
e The union of BN A;, BN Ag, ... BN Ay is B. 


Practice Exercise 2.7. Prove the above two statements. 


Solution. To prove the first statement, we can pick € € (BM A;). This means that 
€ € Band € € Aj. Since € € Aj, it cannot be in A; because A; and A, are disjoint. 
Therefore € cannot live in BM A;. This completes the proof, because we just showed 
that any € € BM A; cannot simultaneously live in BM A;. 


To prove the second statement, we pick € € U_,(B Aj). Since € lives in the 
union, it has to live in at least one of the (BM A;) for some i. Now suppose € € BN Aj. 
This means that € is in both B and Aj, so it must live in B. Therefore, LU", (BN A;) C 
B. Now, suppose we pick € € B. Then since it is an element in B, it must be an element 
in all of the (BM A;)’s for any i. Therefore, € € Uj_,(BM A;), and so we showed that 


BCU, (BNA;). Combining the two directions, we conclude that Uj_,(BN Ai) = B. 


Example 2.18. Let 2 = {1,2,3,4,5,6} and let a partition of 2 be A; = {1,2,3}, 
Ap = {4,5}, Az = {6}. Let B = {1,3,4}. Then, by the result we just proved, B can 
be decomposed into three subsets: 


Bn A; = {1,3}, BO Ag = {4}, Bn A3 = 90. 


Thus we can see that BN Ay, BN Ag and BN A3 are disjoint. Furthermore, the union 
of these three sets gives B. 


2.1.9 Set operations 


When handling multiple sets, it would be useful to have some basic set operations. There 
are four basic theorems concerning set operations that you need to know for our purposes 
in this book: 


Theorem 2.2 (Commutative). (Order does not matter) 


ANB=BNA, and AUB=BUA. 


Theorem 2.3 (Associative). (How to do multiple union and intersection) 


AU(BUC) =(AUB)UC, 
AN(BNC) =(ANB)NC. 
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Theorem 2.4 (Distributive). (How to miz union and intersection) 


AN(BUC) =(ANB)U(ANC), 
AU(BNC) =(AUB)N(AUC). 


Theorem 2.5 (De Morgan’s Law). (How to complement over intersection and union) 


(ANB) = ACU BS, 
(AUB) = ACN Be. (2.19) 


Example 2.19. Consider [1, 4] M ([0, 2] U [3,5]). By the distributive property we can 
simplify the set as 


[1,4] N ([0, 2] U [8, 5)) U ([1, 4] 9 [8, 5)) 


Example 2.20. Consider ([0, 1] U [2, 3])°. By De Morgan’s Law we can rewrite the set 
as 
(0, 2] U [1, 3])° = [0, 2]° A [1, 3]°. 


2.1.10 Closing remarks about set theory 


It should be apparent why set theory is useful: it shows us how to combine, split, and 
remove sets. In Figure 2.11 we depict the intersection of two sets A = {even number} and 
B = {less than or equal to 3}. Set theory tells us how to define the intersection so that the 
probability can be applied to the resulting set. 


NBEP| @ |=; 


Figure 2.11: When there are two events A and B, the probability of AM B is determined by first taking 
the intersection of the two sets and then evaluating its probability. 


Universal sets and empty sets are useful too. Universal sets cover all the possible 
outcomes of an experiment, so we should expect P/Q] = 1. Empty sets contain nothing, 
and so we should expect P{@] = 0. These two properties are essential to define a probability 
because no probability can be greater than 1, and no probability can be less than 0. 
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2.2 Probability Space 


We now formally define probability. Our discussion will be based on the slogan probability 
is a measure of the size of a set. Three elements constitute a probability space: 


e Sample Space (2: The set of all possible outcomes from an experiment. 


e Event Space F: The collection of all possible events. An event £ is a subset in 2 that 
defines an outcome or a combination of outcomes. 


e Probability Law P: A mapping from an event E to a number P/E] which, ideally, 
measures the size of the event. 


Therefore, whenever you talk about “probability,” you need to specify the triplet (Q, F,P) 
to define the probability space. 

The necessity of the three elements is illustrated in Figure 2.12. The sample space 
is the interface with the physical world. It is the collection of all possible states that can 
result from an experiment. Some outcomes are more likely to happen, and some are less 
likely, but this does not matter because the sample space contains every possible outcome. 
The probability law is the interface with the data analysis. It is this law that defines the 
likelihood of each of the outcomes. However, since the probability law measures the size of 
a set, the probability law itself must be a function, a function whose argument is a set and 
whose value is a number. An outcome in the sample space is not a set. Instead, a subset in 
the sample space is a set. Therefore, the probability should input a subset and map it to a 
number. The collection of all possible subsets is the event space. 


Physical World Probability Model Data Analysis 


Sample Space 


Figure 2.12: Given an experiment, we define the collection of all outcomes as the sample space. A 
subset in the sample space is called an event. The probability law is a mapping that maps an event to 
a number that denotes the size of the event. 


A perceptive reader like you may be wondering why we want to complicate things to 
this degree when calculating probability is trivial, e.g., throwing a die gives us a probability 
é per face. In a simple world where problems are that easy, you can surely ignore all these 
complications and proceed to the answer z: However, modern data analysis is not so easy. 
If we are given an image of size 64 x 64 pixels, how do we tell whether this image is of a cat 


or a dog? We need to construct a probability model that tells us the likelihood of having a 
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particular 64 x 64 image. What should be included in this probability model? We need to 
know all the possible cases (the sample space), all the possible events (the event space), 
and the probability of each of the events (the probability law). If we know all these, then our 
decision will be theoretically optimal. Of course, for high-dimensional data like images, we 
need approximations to such a probability model. However, we first need to understand the 
theoretical foundation of the probability space to know what approximations would make 
sense. 


2.2.1 Sample space (2 


We start by defining the sample space 2. Given an experiment, the sample space 2 is the 
set containing all possible outcomes of the experiment. 


Definition 2.13. A sample space ( is the set of all possible outcomes from an ex- 


periment. We denote € as an element in Q. 


A sample space can contain discrete outcomes or continuous outcomes, as shown in 
the examples below and Figure 2.13. 


Example 2.21: (Discrete Outcomes) 
e Coin flip: OQ = {H, T}. 
e Throw a die: 0 = {1,2,3,4,5, 6}. 
e Paper / scissor / stone: Q = {paper, scissor, stone}. 


e Draw an even integer: 0 = {2,4,6,8,...}. 


Example 2.22: (Continuous Outcomes) 


e Waiting time for a bus in West Lafayette: Q = {t | 0 < ¢ < 30 minutes}. 
e Phase angle of a voltage: Q = {0 | 0< 0 < 27}. 
e Frequency of a pitch: O={f | 0< f < fimax}- 


Figure 2.13 also shows a functional example of the sample space. In this case, the 
sample space contains functions. For example, 


e Set of all straight lines in 2D: 


Q={f| f(x) =ar+b, a,be€ R}. 


e Set of all cosine functions with a phase offset: 


Q={f | f(t) = cos(2twot + 0), 0 < O < Ir}. 


As we see from the above examples, the sample space is nothing but a universal set. 
The elements inside the sample space are the outcomes of the experiment. If you change 
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Figure 2.13: The sample space can take various forms: it can contain discrete numbers, or continuous 
intervals, or even functions. 


the experiment, the possible outcomes will be different so that the sample space will be 
different. For example, flipping a coin has different possible outcomes from throwing a die. 


What if we want to describe a composite experiment where we flip a coin and throw a 
die? Here is the sample space: 


Example 2.23: If the experiment contains flipping a coin and throwing a die, then 
the sample space is 


{ (#1), (H,2), (H, 3), (H,4), (H,5), (H,6), 
(Ei 2S) (TRA) (ee) (ae oh. 


In this sample space, each element is a pair of outcomes. 


Practice Exercise 2.8. There are 8 processors on a computer. A computer job sched- 
uler chooses one processor randomly. What is the sample space? If the computer job 
scheduler can choose two processors at once, what is the sample space then? 


Solution. The sample space of the first case is Q = {1,2,3,4,5,6,7,8}. The sample 
space of the second case is 0 = {(1, 2), (1,3), (1,4),...,(7,8)}. 


Practice Exercise 2.9. A cell phone tower has a circular average coverage area of 
radius of 10 km. We observe the source locations of calls received by the tower. What 
is the sample space of all possible source locations? 


Solution. Assume that the center of the tower is located at (xo, yo). The sample space 
is the set 


2 = {(x,y) | V(x — 20)? + (y— yo)? < 10}. 
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Not every set can be a sample space. A sample space must be exhaustive and exclusive. 
The term “exhaustive” means that the sample space has to cover all possible outcomes. If 
there is one possible outcome that is left out, then the set is no longer a sample space. The 
term “exclusive” means that the sample space contains unique elements so that there is no 
repetition of elements. 


Example 2.24. (Counterexamples) 


The following two examples are NOT sample spaces. 


e Throw a die: 0 = {1,2,3} is not a sample space because it is not exhaustive. 
e Throw a die: 2 = {1,1, 2,3, 4,5, 6} is not a sample space because it is not exclu- 
sive. 


Therefore, a valid sample space must contain all possible outcomes, and each element 
must be unique. 


We summarize the concept of a sample space as follows. 


What is a sample space (2? 
e A sample space (2 is the collection of all possible outcomes. 


e The outcomes can be numbers, alphabets, vectors, or functions. The outcomes 
can also be images, videos, EEG signals, audio speeches, etc. 


e (2 must be exhaustive and exclusive. 


2.2.2 Event space ~ 


The sample space contains all the possible outcomes. However, in many practical situations, 
we are not interested in each of the individual outcomes; we are interested in the com- 
binations of the outcomes. For example, when throwing a die, we may ask “What is the 
probability of rolling an odd number?” or “What is the probability of rolling a number that 
is less than 3?” Clearly, “odd number” is not an outcome of the experiment because the 
possible outcomes are {1,2,3,4,5,6}. We call “odd number” an event. An event must be a 
subset in the sample space. 


Definition 2.14. An event EF is a subset in the sample space Q. The set of all possible 


events is denoted as F. 


While this definition is extremely simple, we need to keep in mind a few facts about events. 
First, an outcome € is an element in 2 but an event F is a subset contained in Q, ice., 
FE CQ. Thus, an event can contain one outcome but it can also contain many outcomes. 
The following example shows a few cases of events: 
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Example 2.25. Throw a die. Let 0 = {1,2,3,4,5,6}. The following are two possible 
events, as illustrated in Figure 2.14. 


e &, = {even numbers} = {2, 4,6}. 
e EF, = {less than 3} = {1, 2}. 


Event = Sea ~ 


ee ja Event = {less than 3} 
ae 


Figure 2.14: Two examples of events: The first event contains numbers {2, 4,6}, and the second 
event contains numbers {1, 2}. 


Practice Exercise 2.10. The “ping” command is used to measure round-trip times 
for Internet packets. What is the sample space of all possible round-trip times? What 
is the event that a round-trip time is between 10 ms and 20 ms? 


Solution. The sample space is 2 = [0,00). The event is EF = [10, 20]. 


Practice Exercise 2.11. A cell phone tower has a circular average coverage area of 
radius 10 km. We observe the source locations of calls received by the tower. What is 
the event when the source location of a call is between 2 km and 5 km from the tower? 


Solution. Assume that the center of the tower is located at (29, yo). The event is 
E = {(z,y) |2< V(e— 20)? + Y— yo)? < 5}. 


The second point we should remember is the cardinality of Q and that of F. A sample 
space containing n elements has a cardinality n. However, the event space constructed from 
Q will contain 2” events. To see why this is so, let’s consider the following example. 


Example 2.26. Consider an experiment with 3 outcomes = {&, 0, +%}. We can list 
out all the possible events: 0, {do}, {O}, (H}, {fe O}, {he BH}, {0 do}, {fe 0,5}. So 
in total there are 2? = 8 possible events. Figure 2.15 depicts the situation. What is 


the difference between #& and {#&}? The former is an element, whereas the latter is a 
set. Thus, {&} is an event but #& is not an event. Why is @ an event? Because we can 
ask “What is the probability that we get an odd number and an even number?” The 
probability is obviously zero, but the reason it is zero is that the event is an empty set. 
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all possible events 
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Figure 2.15: The event space contains all the possible subsets inside the sample space. 


In general, if there are n elements in the sample space, then the number of events 
is 2”. To see why this is true, we can assign to each element a binary value: either 0 or 1. 
For example, in Table 2.1 we consider throwing a die. For each of the six faces, we assign a 
binary code. This will give us a binary string for each event. For example, the event {1,5} 
is encoded as the binary string 100010 because only 1 and 5 are activated. We can count 
the total number of unique strings, which is the number of strings that can be constructed 
from n bits. It is easily seen that this number is 2”. 


Event 1 2 3 #4 5. 6 | Binary Code 
0 x «K Kk xX. x 000000 
{1,5} OQ x. x x © x 100010 


x 
x 


{3, 4,5} O OO CO x 001110 


{2,3,4,5,64 |x O O O O O|. onl 
{1,23,45,6}/0 OOO0O000 111111 


Table 2.1: An event space contains 2” events, where n is the number of elements in the sample space. 
To see this, we encode each outcome with a binary code. The resulting binary string then forms a unique 
index of the event. Counting the total number of events gives us the cardinality of the event space. 


The box below summarizes what you need to know about event spaces. 


What is an event space ¥? 


e An event space ¥ is the set of all possible subsets. It is a set of sets. 


e We need F because the probability law P is mapping a set to a number. P does 
not take an outcome from 2 but a subset inside 2. 
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Event spaces: Some advanced topics 


The following discussions can be skipped if it is your first time reading the book. 


What else do we need to take care of in order to ensure that an event is well defined? 
A few set operations seem to be necessary. For example, if Ey, = {1} and Ey = {2} are 
events, it is necessary that FE = E, U Ep = {1,2} is an event too. Another example: if 
E, = {5,6} and £2 = {1,5} are events, then it is necessary that E = EF, M E2 = {5} is also 
an event. The third example: if £; = {3,4,5,6} is an event, then FE = Ef = {1,2} should 
be an event. As you can see, there is nothing sophisticated in these examples. They are just 
some basic set operations. We want to ensure that the event space is closed under these 
set operations. That is, we do not want to be surprised by finding that a set constructed 
from two events is not an event. However, since all set operations can be constructed from 
union, intersection and complement, ensuring that the event space is closed under these 
three operations effectively ensures that it is closed to all set operations. 

The formal way to guarantee these is the notion of a field. This term may seem to be 
abstract, but it is indeed quite useful: 


Definition 2.15. For an event space F to be valid, F must be a field F. It is a field 
if it satisfies the following conditions 


eeF andQe Ff. 


e (Closed under complement) If F € F, then also F° € F. 


e (Closed under union and intersection) If F, € F and Fy € F, then Fi Fy € F 
and F, U Fy € F. 


For a finite set, i.e., a set that contains n elements, the collection of all possible subsets 
is indeed a field. This is not difficult to see if you consider rolling a die. For example, if 
E = {3,4, 5,6} is inside F, then E* = {1, 2} is also inside F. This is because F consists of 2” 
subsets each being encoded by a unique binary string. So if E = 001111, then E* = 110000, 
which is also in F. Similar reasoning applies to intersection and union. 

At this point, you may ask: 


e Why bother constructing a field? The answer is that probability is a measure of the 
size of a set, so we must input a set to a probability measure P to get a number. The 
set being input to P must be a subset inside the sample space; otherwise, it will be 
undefined. If we regard P as a mapping, we need to specify the collection of all its 
inputs, which is the set of all subsets, i.e., the event space. So if we do not define the 
field, there is no way to define the measure P. 


e What if the event space is not a field? If the event space is not a field, then we can 
easily construct pathological cases where we cannot assign a probability. For example, 
if the event space is not a field, then it would be possible that the complement of 
E = {3,4,5,6} (which is E° = {1,2}) is not an event. This just does not make sense. 


The concept of a field is sufficient for finite sample spaces. However, there are two 
other types of sample spaces where the concept of a field is inadequate. The first type of 
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sets consists of the countably infinite sets, and the second type consists of the sets defined 
on the real line. There are other types of sets, but these two have important practical 
applications. Therefore, we need to have a basic understanding of these two types. 


Sigma-field 


The difficulty of a countably infinite set is that there are infinitely many subsets in the field 
of a countably infinite set. Having a finite union and a finite intersection is insufficient to 
ensure the closedness of all intersections and unions. In particular, having FU Fh € F does 
not automatically give us U7, Fn € F because the latter is an infinite union. Therefore, 
for countably infinite sets, their requirements to be a field are more restrictive as we need 
to ensure infinite intersection and union. The resulting field is called the o-field. 


Definition 2.16. A sigma-field (o-field) F is a field such that 
e F is a field, and 


e if F\, Fa... € F, then the union U%, F; and the intersection (\-_, Fi are both 
Uae 


When do we need a o-field? When the sample space is countable and has infinitely 
many elements. For example, if the sample space contains all integers, then the collection 
of all possible subsets is a o-field. For another, if E, = {2}, Ey = {4}, E3 = {6}, ..., then 
Ur, En = {2,4,6,8, ...} = {positive even numbers}. Clearly, we want U7", Ep, to live in 
the sample space. 


Borel sigma-field 


While a sigma-field allows us to consider countable sets of events, it is still insufficient for 
considering events defined on the real line, e.g., time, as these events are not countable. 
So how do we define an event on the real line? It turns out that we need a different way 
to define the smallest unit. For finite sets and countable sets, the smallest units are the 
elements themselves because we can count them. For the real line, we cannot count the 
elements because any non-empty interval is uncountably infinite. 

The smallest unit we use to construct a field for the real line is a semi-closed interval 


(—o0, }] 2 fx | —co <a< bd}. 


The Borel o-field is defined as the sigma-field generated by the semi-closed inter- 
vals. 


Definition 2.17. The Borel o-field B is a a-field generated from semi-closed intervals: 


(—o0, }] 2 fx | —co<2< bd}. 


The difference between the Borel o-field 6 and a regular o-field is how we measure the 
subsets. In a o-field, we count the elements in the subsets, whereas, in a Borel o-field, we 
use the semi-closed intervals to measure the subsets. 
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Being a field, the Borel o-field is closed under complement, union, and intersection. In 
particular, subsets of the following forms are also in the Borel o-field B: 
(a,b), [a,0], (a, 6], [a,6), [a,00), (a,00), (—00, b], {5}. 


For example, (a,0o) can be constructed from (—oo,a]°, and (a, 6] can be constructed by 
taking the intersection of (—co, b] and (a, oo). 


Example 2.27: Waiting for a bus. Let Q = {0 < t < 30}. The Borel o-field contains 
all semi-closed intervals (a, b], where 0 < a < b < 30. Here are two possible events: 


e F, = {less than 10 minutes} = {0 < t < 10} = {0} U({0 <t < 10}/N {10}°). 
e Ff) = {more than 20 minutes} = {20 < t < 30}. 


Further discussion of the Borel o-field can be found in Leon-Garcia (3rd Edition,) 
Chapter 2.9. 


This is the end of the discussion. Please join us again. 


2.2.3 Probability law P 


The third component of a probability space is the probability law P. Its job is to assign a 
number to an event. 


Definition 2.18. A probability law is a function P : F — [0,1] of an event E to a 
real number in [0, 1]. 


The probability law is thus a function, and therefore we must specify the input and 
the output. The input to P is an event E, which is a subset in 2 and an element in F. The 
output of P is a number between 0 and 1, which we call the probability. 

The definition above does not specify how an event is being mapped to a number. 
However, since probability is a measure of the size of a set, a meaningful P should be 
consistent for all events in ¥. This requires some rules, known as the axioms of probability, 
when we define the P. Any probability law P must satisfy these axioms; otherwise, we will 
see contradictions. We will discuss the axioms in the next section. For now, let us look at 
two examples to make sure we understand the functional nature of P. 


Example 2.28. Consider flipping a coin. The event space is F = {0,{H}, {7}, Q}. 
We can define the probability law as 


Pi] =0, PAH =5, PUTH=5, Pio)=1, 


as shown in Figure 2.16. This P is clearly consistent for all the events in F. 
Is it possible to construct an invalid P? Certainly. Consider the following proba- 
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bility law: 
il 1 
Pid] =0, PHA = 3, PATH= 3, PQ] =1. 
This law is invalid because the individual events are P[{H}] = 3 and P[{T}] = 3 


but the union is P[Q] = 1. To fix this problem, one possible solution is to define the 
probability law as 


Pl] =0, PICA} = 5, PATH= 5, Pio) =1. 


Then, the probabilities for all the events are well defined and consistent. 


@ 


Figure 2.16: A probability law is a mapping from an event to a number. A probability law cannot be 
arbitrarily assigned; it must satisfy the axioms of probability. 


Example 2.29. Consider a sample space containing three elements 0 = {d@, 0, %}. 
The event space is then F = {0 (4). {UO}, (H}, {ef O}, (UHH, (408). (4.9.59} 


One possible P we could define would be 


P[d] =0, P[{#}] = PI{}] = PIPH}] = = 


Pl{H, OF] = Pt 3] = PHO] = 5, Pll, O,}] = 1. 


a? 


What is a probability law P? 
e A probability law P is a function. 


e It takes a subset (an element in F) and maps it to a number between 0 and 1. 


e P is a measure of the size of a set. 


e For P to be valid, it must satisfy the axioms of probability. 
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&) oie, G2) E 
as “Si 
a a Go ; ! | 
P[E] = count P[E] = length P(E] = area 


Figure 2.17: Probability is a measure of the size of a set. The probability can be a counter that counts 
the number of elements, a ruler that measures the length of an interval, or an integration that measures 
the area of a region. 


A probability law P is a measure 


Consider the word “measure” in our slogan: probability is a measure of the size of a set. 
Depending on the nature of the set, the measure can be a counter, ruler, scale, or even a 
stopwatch. So far, all the examples we have seen are based on sets with a finite number of 
elements. For these sets, the natural choice of the probability measure is a counter. However, 
if the sets are intervals on the real line or regions in a plane, we need a different probability 
law to measure their size. Let’s look at the examples shown in Figure 2.17. 


Example 2.30 (Finite Set). Consider throwing a die, so that 
Q = {1,2,3,4,5, 6}. 


Then the probability measure is a counter that reports the number of elements. If 
the die is fair, i.e., all the 6 faces have equal probability of happening, then an event 
E = {1,3} will have a probability P[E] = 4 


Example 2.31 (Intervals). Suppose that the sample space is a unit interval Q = [0, 1]. 
Let E be an event such that EL = [a,b] where a,b are numbers in [0,1]. Then the 
probability measure is a ruler that measures the length of the intervals. If all the 
numbers on the real line have equal probability of appearing, then P[E] = b— a. 


Example 2.32 (Regions). Suppose that the sample space is the square 2 = [—1, 1] x 
[—1,1]. Let E be a circle such that E = {(x,y)|x? + y? < r?}, where r < 1. Then the 
probability measure is an area measure that returns us the area of EF. If we assume 
that all coordinates in 2 are equally probable, then P[E] = mr?, for r < 1. 


Because probability is a measure of the size of a set, two sets can be compared according 
to their probability measures. For example, if Q = {&, VE}, and if Ey = {&} and Ep = 
{#, 0}, then one possible P is to assign P[E,] = P[{#}] = § and P[E2] = P[{#, V}] = 2/3. 
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In this particular case, we see that E,; C E> and thus 
P[Ey] < P[E9]. 


Let’s now consider the term “size.” Notice that the concept of the size of a set is not 
limited to the number of elements. A better way to think about size is to imagine that it is 
the weight of the set. This might may seem fanciful at first, but it is quite natural. Consider 
the following example. 


Example 2.33. (Discrete events with different weights) Suppose we have a sample 
space 0 = {&, V,-H}. Let us assign a different probability to each outcome: 


PI@=5. POH =, PIOMI= 5. 


As illustrated in Figure 2.18, since each outcome has a different weight, when de- 
termining the probability of a set of outcomes we can add these weights (instead of 
counting the number of outcomes). For example, when reporting P[{#}] we find its 
weight P[{&}] = 2, whereas when reporting P[{Y,4}] we find the sum of their weights 
PI{Y,¥}] = 4+ 3 = 4. Therefore, the notion of size does not refer to the number of 
elements but to the total weight of these elements. 


& 0 Pi] =|] 
Figure 2.18: This example shows the “weights” of three elements in a set. The weights are numbers 
between 0 and 1 such that the sum is 1. When applying a probability measure to this set, we sum the 


weights for the elements in the events being considered. For example, P[V, +4] = yellow + green, and 
P|] = purple. 


Example 2.34. (Continuous events with different weights) Suppose that the sample 
space is an interval, say Q = [—1,1]. On this interval we define a weighting function 
f(x) where f(a) specifies the weight for zo. Because (2 is an interval, events defined 


on this 2 must also be intervals. For example, we can consider two events E; = [a,b] 
and E, = [c,d]. The probabilities of these events are P[E] = iL f(x) dx and P[E2] = 
f? f(a) dz, as shown in Figure 2.19. 


Viewing probability as a measure is not just a game for mathematicians; rather, it 
has fundamental significance for several reasons. First, it eliminates any dependency on 
probability as relative frequency from the frequentist point of view. Relative frequency is a 
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fv) P| la, D] | | = [Vw 


a b c od 
Figure 2.19: If the sample space is an interval on the real line, then the probability of an event is the 
area under the curve of the weighting function. 


narrowly defined concept that is largely limited to discrete events, e.g., flipping a coin. While 
we can assign weights to coin-toss events to deal with those biased coins, the extension to 
continuous events becomes problematic. By thinking of probability as a measure, we can 
generalize the notion to apply to intervals, areas, volumes, and so on. 

Second, viewing probability as a measure forces us to disentangle an event from mea- 
sures. An event is a subset in the sample space. It has nothing to do with the measure 
(e.g., a ruler) you use to measure the event. The measure, on the other hand, specifies the 
weighting function you apply to measure the event when computing the probability. For 
example, let 2 = [—1,1] be an interval, and let E = [a,b] be an event. We can define two 
weighting functions f(x) and g(a). Correspondingly, we will have two different probability 
measures F and G such that 


rot) = f ae = fe) ae 
cas) =f ac= fae) es (2.20) 


To make sense of these notations, consider only P{[a, b]] and not F({a, |) and G({a, b]). As you 
can see, the event for both measures is FE = [a,b] but the measures are different. Therefore, 
the values of the probability are different. 


Example 2.35. (Two probability laws are different if their weighting functions are 
different.) Consider two different weighting functions for throwing a die. The first one 
assigns probability as the following: 


PLC = a> PI =, PIB} = 
PI(4}] =<, PUSH = p> PCG} = 
whereas the second function assigns the probability like this: 
PI = 5, PI =, PB = 4, 
PI{4H = 5, PUSH = 5, PIO = 5. 


70 


2.2. PROBABILITY SPACE 


Let an event & = {1,2}. Let F be the measure using the first set of probabilities, and 
let G be the measure of the second set of probabilities. Then, 


2 3 


F(E) =F) = +5 =o 


G(B) =G(0,) = 4+ 4=5. 


Therefore, although the events are the same, the two different measures will give us 
two different probability values. 


Remark. The notation /,, dF in Equation (2.20) is known as the Lebesgue integral. You 
should be aware of this notation, but the theory of Lebesgue measure is beyond the scope 
of this book. 


2.2.4 Measure zero sets 


Understanding the measure perspective on probability allows us to understand another 
important concept of probability, namely measure zero sets. To introduce this concept, we 
pose the question: What is the probability of obtaining a single point, say {0.5}, when the 
sample space is Q = [0,1]? 

The answer to this question is rooted in the compatibility between the measure and 
the sample space. In other words, the measure has to be meaningful for the events in the 
sample space. Using 2 = [0, 1], since 2 is an interval, an appropriate measure would be the 
length of this interval. You may add different weighting functions to define your measure, 
but ultimately, the measure must be an integral. If you use a “counter” as a measure, then 
the counter and the interval are not compatible because you cannot count on the real line. 

Now, suppose that we define a measure for 2 = [0, 1] using a weighting function f(x). 
This measure is determined by an integration. Then, for E = {0.5}, the measure is 

0.5 
P[E] = P[{0.5}] = f(x) dx = 


0.5 


In fact, for any weighting function the integral will be zero because the length of the set 
E is zero. An event that gives us zero probability is known as an event with measure 0. 
Figure 2.20 shows an example. 


Plobtaining a single point xo] = 0 


\ 


xo x 


Figure 2.20: The probability of obtaining a single point in a continuous interval is zero. 


1We assume that f is continuous throughout [0,1]. If f is discontinuous at « = 0.5, some additional 
considerations will apply. 
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What are measure zero sets? 


e A set E (non-empty) is called a measure zero set when P[E] = 0. 


e For example, {0} is a measure zero set when we use a continuous measure F. 


e But {0} can have a positive measure when we use a discrete measure G. 


Example 2.36(a). Consider a fair die with Q = {1,2,3,4,5,6}. Then the set {1} has 
a probability of z: The sample space does not have a measure zero event because the 
measure we use is a counter. 


Example 2.36(b). Consider an interval with Q = [1,6]. Then the set {1} has measure 
0 because it is an isolated point with respect to the sample space. 


Example 2.36(c). For any intervals, P[[a, b]] = P[(a,6)] because the two end points 
have measure zero: P[{a}] = P[{b}] = 0. 


Formal definitions of measure zero sets 


The following discussion of the formal definitions of measure zero sets is optional for the 
first reading of this book. 


We can formally define measure zero sets as follows: 
Definition 2.19. Let 1 be the sample space. A set A € 2 is said to have measure 
zero if for any given « > 0, 


e There exists a countable number of subsets A, such that A C UP, An, and 
e eS P[A,] <6. 


You may need to read this definition carefully. Suppose we have an event A. We construct 
a set of neighbors Aj,..., Ao. such that A is included in the union U?2, Ap. If the sum of 
the all P[A,] is still less than ¢, then the set A will have a measure zero. 

To understand the difference between a measure for a continuous set and a countable 
set, consider Figure 2.21. On the left side of Figure 2.21 we show an interval Q in which there 
is an isolated point x9. The measure for this 2 is the length of the interval (relative to what- 
ever weighting function you use). We define a small neighborhood Ag = (xo — §, 20 + §) 
surrounding xp. The length of this interval is not more than e. We then shrink e. How- 
ever, regardless of how small € is, since x9 is an isolated point, it is always included in the 
neighborhood. Therefore, the definition is satisfied, and so {9} has measure zero. 


Example 2.37. Let Q = [0,1]. The set {0.5} C M has measure zero, i.e., P[{0.5}] = 0. 


To see this, we draw a small interval around 0.5, say [0.5 — €/3,0.5 + €/3]. Inside this 
interval, there is really nothing to measure besides the point 0.5. Thus we have found 
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an interval such that it contains 0.5, and the probability is P[[0.5 — €/3,0.5 + €/3]] = 
2¢€/3 < ¢. Therefore, by definition, the set {0.5} has measure 0. 


The situation is very different for the right-hand side of Figure 2.21. Here, the measure 
is not the length but a counter. So if we create a neighborhood surrounding the isolated 
point x9, we can always make a count. As a result, if you shrink € to become a very small 
number (in this case less than $), then P[{zo}] < € will no longer be true. Therefore, the 
set {xo} has a non-zero measure when we use the counter as the measure. 


€ € € € 
é Ao = {to- 5 SaSao+5} “ Ap = {29-5 Sa<ao+sh 


oes 2 


a : \ P[Ao] = count : ; 


A= P[Ao] = length = € rere 


Figure 2.21: [Left] For a continuous sample space, a single point event {zo} can always be surrounded 
by a neighborhood Ao whose size P[Ao] < e. [Right] If you change the sample space to discrete 
elements, then a single point event {xo} can still be surrounded by a neighborhood Ao. However, the 
size P[Ao] = 1/4 is a fixed number and will not work for any e. 


When we make probabilistic claims without considering the measure zero sets, we say 
that an event happens almost surely. 


Definition 2.20. An event A € R is said to hold almost surely (a.s.) if 


P[A] =1 


except for all measure zero sets in R. 


Therefore, if a set A contains measure zero subsets, we can simply ignore them because they 
do not affect the probability of events. In this book, we will omit “a.s.” if the context is 
clear. 


Example 2.38(a). Let 2 = [0,1]. Then P[(0, 1)] = 1 almost surely because the points 
0 and 1 have measure zero in 2. 

Example 2.38(b). Let 0 = {x | 2? < 1} and let A = {x | 2? < 1}. Then P[A] = 1 
almost surely because the circumference has measure zero in 2. 


Practice Exercise 2.12. Let 20 = {f : R > [—1,1]| f(¢) = cos(wot + @)}, where wy is 
a fixed constant and @ is random. Construct a measure zero event and an almost sure 
event. 


Solution. Let 


E={f:R- [-1,1]| f(t) =cos(wot + kr/2)} 


for any integer k. That is, F contains all the functions with a phase of 1/2, 27/2, 37/2, 
etc. Then F will have measure zero because it is a countable set of isolated functions. 
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The event E° will have probability P[E°] = 1 almost surely because E’ has measure 
Zero. 


This is the end of the discussion. Please join us again. 


2.2.5 Summary of the probability space 


After the preceding long journey through theory, let us summarize. 

First, it is extremely important to understand our slogan: probability is a measure of 
the size of a set. This slogan is precise, but it needs clarification. When we say probability 
is a measure, we are thinking of it as being the probability law P. Of course, in practice, we 
always think of probability as the number returned by the measure. However, the difference 
is not crucial. Also, “size” not only means the number of elements in the set, but it also 
means the relative weight of the set in the sample space. For example, if we use a weight 
function to weigh the set elements, then size would refer to the overall weight of the set. 

When we put all these pieces together, we can understand why a probability space 
must consist of the three components 


(Q,F,P), (2.22) 


where 1) is the sample space that defines all possible outcomes, F is the event space generated 
from Q, and P is the probability law that maps an event to a number in [0, 1]. Can we drop 
one or more of the three components? We cannot! If we do not specify the sample space Q, 
then there is no way to define the events. If we do not have a complete event space F, 
then some events will become undefined, and further, if the probability law is applied only 
to outcomes, we will not be able to define the probability for events. Finally, if we do not 
specify the probability law, then we do not have a way to assign probabilities. 


2.3. Axioms of Probability 


We now turn to a deeper examination of the properties. Our motivation is simple. While 
the definition of probability law has achieved its goal of assigning a probability to an event, 
there must be restrictions on how the assignment can be made. For example, if we set 
P{{H}] = 1/3, then P[{T}] must be 2/3; otherwise, the sum of having a head and a tail 
will be greater than 1. The necessary restrictions on assigning a probability to an event are 
collectively known as the axioms of probability. 


Definition 2.21. A probability law is a function P : F — [0,1] that maps an event 
A to a real number in [0, 1]. The function must satisfy the axioms of probability: 


I. Non-negativity: P[A] > 0, for any AC Q. 


IT. Normalization: P[Q] = 1. 
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III. Additivity: For any disjoint sets {A,, Ao,...}, it must be true that 


P Oar = Ee 


An axiom is a proposition that serves as a premise or starting point in a logical system. 
Axioms are not definitions, nor are they theorems. They are believed to be true or true 
within a certain context. In our case, the axioms are true within the context of Bayesian 
probability. The Kolmogorov probability relies on another set of axioms. We will not dive 
into the details of these historical issues; in this book, we will confine our discussion to the 
three axioms given above. 


2.3.1 Why these three probability axioms? 
Why do we need three axioms? Why not just two axioms? Why these three particular 


axioms? The reasons are summarized in the box below. 


Why these three axioms? 


e Axiom I (Non-negativity) ensures that probability is never negative. 


e Axiom II (Normalization) ensures that probability is never greater than 1. 


e Axiom III (Additivity) allows us to add probabilities when two events do not 
overlap. 


Axiom | is called the non-negativity axiom. It ensures that a probability value cannot 
be negative. Non-negativity is a must for probability. It is meaningless to say that the 
probability of getting an event is a negative number. 


Axiom II is called the normalization axiom. It ensures that the probability of observing 
all possible outcomes is 1. This gives the upper limit of the probability. The upper limit 
does not have to be 1. It could be 10 or 100. As long as we are consistent about this upper 
limit, we are good. However, for historical reasons and convenience, we choose 1 to be the 
upper limit. 


Axiom III is called the additivity axiom and is the most critical one among the three. 
The additivity axiom defines how set operations can be translated into probability oper- 
ations. In a nutshell, it says that if we have a set of disjoint events, the probabilities can 
be added. From the measure perspective, Axiom III makes sense because if P measures the 
size of an event, then two disjoint events should have their probabilities added. If two dis- 
joint events do not allow their probabilities to be added, then there is no way to measure 
a combined event. Similarly, if the probabilities can somehow be added even for overlap- 
ping events, there will be inconsistencies because there is no systematic way to handle the 
overlapping regions. 


The countable additivity stated in Axiom III can be applied to both a finite number 
or an infinite number of sets. The finite case states that for any two disjoint sets A and B, 


we have 
P[AU B] = P[A] + P[B]. (2.24) 
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In other words, if A and B are disjoint, then the probability of observing either A or B is 
the sum of the two individual probabilities. Figure 2.22 illustrates this idea. 


Example 2.39. Let’s see why Axiom III is critical. Consider throwing a fair die with 
Q = {1,2,3,4,5,6}. The probability of getting {4, 6} is 


P[{4, 6}] = P[{4} U {6}] = P[{4}] + PL{6}] = : 


In this equation, the second equality holds because the events {4} and {6} are disjoint. 
If we do not have Axiom III, then we cannot add probabilities. 


C) P[AU B] 6) P[A] 
; : P[B] 


Figure 2.22: Axiom III says P[AU B] = P[A] + P[B] if ANB=9. 


2.3.2 Axioms through the lens of measure 


Axioms are “rules” we must abide by when we construct a measure. Therefore, any valid 
measure must be compatible with the axioms, regardless of whether we have a weighting 
function or not. In the following two examples, we will see how the weighting functions are 
used in the axioms. 


Example 2.40. Consider a sample space with Q = {&, 9,44}. The probability for 
each outcome is 


PH@H= 5, PHOW= 5, PIO] = 5. 


Suppose we construct two disjoint events EF, = {&,0} and £2 = {4}. Then Axiom 
III says 


P[E, U Ey] = P[E,] + PE] = ( rs z) a : =1, 


Note that in this calculation, the measure P is still a measure P. If we endow it 
with a nonuniform weight function, then P applies the corresponding weights to the 
corresponding outcomes. This process is compatible with the axioms. See Figure 2.23 
for a pictorial illustration. 


76 


2.3. AXIOMS OF PROBABILITY 


Example 2.41. Suppose the sample space is an interval = [0,1]. The two events 
are E, = [a,b] and EF = [c,d]. Assume that the measure P uses a weighting function 
f(x). Then, by Axiom III, we know that 


P[E, U Eo] = P[E\] + P[E)] 


= P|[a, b]] + Plc, d]] (by Axiom 3) 


b d 
= / f(a) dx +f f(a) daz, (apply the measure). 


As you can see, there is no conflict between the axioms and the measure. Figure 2.24 
illustrates this example. 


Py Pee @] 


= P[@*]+Pla) 


fo 
3 
E 


Figure 2.23: Applying weighting functions to the measures: Suppose we have three elements in the set. 
To compute the probability P[{U, 44} U {de}], we can write it as the sum of P[{V, 4}] and P[{d&}]. 


f(z) b d 
Pla.) Uled) = fax+ [ » de 


a b c ad 


Figure 2.24: The axioms are compatible with the measure, even if we use a weighting function. 


2.3.3 Corollaries derived from the axioms 


The union of A and B is equivalent to the logical operator “OR”. Once the logical operation 
“OR” is defined, all other logical operations can be defined. The following corollaries are 
examples. 


Corollary 2.1. Let A € F be an event. Then, 
(a) P[A®] = 1— P(A]. 


(b) 
(c) I 
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Proof. (a) Since Q = AU A’, by finite additivity we have P[Q] = P[AU A‘] = P[A] + P[A‘]. 
By the normalization axiom, we have P[{] = 1. Therefore, P[A‘] = 1 — P[A]. 


(b) We prove by contradiction. Assume P[A] > 1. Consider the complement A° where 
AUA‘ = 2. Since P[A‘°] = 1—PA], we must have P[A‘] < 0 because by hypothesis P[A] > 1. 
But P[A‘] < 0 violates the non-negativity axiom. So we must have P[A] < 1. 


(c) Since Q = QU, by the first corollary we have P[O] = 1 — P[Q] = 0. 


Corollary 2.2 (Unions of Two Non-Disjoint Sets). For any A and B in Ff, 


P[AU B] = P[A] + P[B] —P[AN BI. (2.25) 


This statement is different from Axiom III because A and B are not necessarily disjoint. 


P[AUB] 


Figure 2.25: For any A and B, P[AU B] = P[A] + P[B] — P[AN B]. 


Proof. First, observe that AU B can be partitioned into three disjoint subsets as AU B = 
(A\B)U(AN B)U (B\A). Since A\B = AN BS and B\A = BN A*, by finite additivity we 
have that 


P[AU B] = P[A\B] + P[AnN B] + P[B\A] = P[AN BY) +P[AN B)+P(Bn Aq 


© P[An BY +P[AN B]+P[Bn A +P[AN B] —P[AnB) 


© PLAN (B° UB)] +P[(A°U A) NB] PLAN B] 


= P[ANQ) +P[QN B] —P[|AnN B] = PIA] +P[B] — PAN BI, 


where in (a) we added and subtracted a term P[AN B], and in (b) we used finite additivity 
so that P[AN B°}+ P[AN B] = P[(AN B°)U (AN B)] =P[AN (Bo UB). 


Example 2.42. The corollary is easy to understand if we consider the following ex- 
ample. Let Q = {1,2,3,4,5,6} be the sample space of a fair die. Let A = {1,2,3} and 
B = {3,4,5}. Then 


PIAU B] = Pi{1,2,3,4,5}] = 2. 
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We can also use the corollary to obtain the same result: 


P[AU B] = P[A] + P[B] —P[AN By 
= P[{1, 2, 3}] + P[{3, 4, 5}] — P[{3}] 


Corollary 2.3 (Inequalities). Let A and B be two events in F. Then, 
(a) P[AU B] < P[A] + P[B]. (Union Bound) 


(b) If AC B, then P[A] < 


Proof. (a) Since P[AUB] = P[A]+P[B]—P[ANB] and by non-negativity axiom P[ANB] > 0, 
we must have P[A U B] < P[A] + P[B]. (b) If A C B, then there exists a set B\A such that 
B= AU(B\A). Therefore, by finite additivity we have P|B] = P|A] + P[B\A] > P[A]. Since 
P[B\ A] > 0, it follows that P[A] + P[B\A] > P[A]. Thus we have P[B] > P[A]. 


Union bound is a frequently used tool for analyzing probabilities when the intersection 
AN B is difficult to evaluate. Part (b) is useful when considering two events of different 
“sizes.” For example, in the bus-waiting example, if we let A = {t < 5}, and B = {t < 10}, 
then P[A] < P[B] because we have to wait for the first 5 minutes to go into the remaining 
5 minutes. 


Practice Exercise 2.13. Let the events A and B have P[A] = 2, P[B] = y and 
P[AU B] = z. Find the following probabilities: P[AM B], P[ASU B‘], and P[AN B‘]. 


Solution. 
(a) Note that z = P[AU B] = P[A] + P[B] — P[An B]. Thus, P[AN B] =a#+y-z. 
(b) We can take the complement to obtain the result: 


P[A° U BY] = 1—P[(A°U B°)*] =1—-P[ANB)] =1-2-ytz. 


(c) P[AN B’] =P[A] -P[ANB]=2 


Practice Exercise 2.14. Consider a sample space 


Q={f:R—-R|f(«) =az, for alla € R,z € R}. 


There are two events: A = {f| f(x) = ax, a> 0}, and B={f| f(x) = az, a < O}. 
So, basically, A is the set of all straight lines with positive slope, and B is the set of 
straight lines with negative slope. Show that the union bound is tight. 
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Solution. First of all, we note that 
P[AU B] = P[A] + P[B] —P[AN BI. 


The intersection is 
P[AN B] =Pl{f| f(x) = O}. 


Since this is a point set in the real line, it has measure zero. Thus, P[ANM B] = 0 and 
hence P[A U B] = P[A] + P[B]. So the union bound is tight. 


Closing remark. The development of today’s probability theory is generally credited to 
Andrey Kolmogorov’s 1933 book Foundations of the Theory of Probability. We close this 
section by citing one of the tables of the book. The table summarizes the correspondence 
between set theory and random events. 


ee 
Theory of sets Random events 


A and B are disjoint, i.c., AN B= Events A and B are incompatible 


A, Ag:::-NAn = 9 Events A,,...,Ay are incompatible 

A, MN Ao:::N An =X Event X is defined as the simultaneous occur- 
rence of events Aj,..., Aw 

A, UAog::-UAn =X Event X is defined as the occurrence of at least. 
one of the events A,,..., An 

gAC The opposite event A° consisting of the non- 


occurrence of event A 


A= Event A is impossible 

A= Event A must occur 

Aj,...,Awn form a partition of 2 The experiment consists of determining which 
of the events A;,...,An occurs 

BCA From the occurrence of event B follows the 


inevitable occurrence of A 


Table 2.2: Kolmogorov’s summary of set theory results and random events. 


2.4 Conditional Probability 


In many practical data science problems, we are interested in the relationship between two 
or more events. For example, an event A may cause B to happen, and B may cause C 
to happen. A legitimate question in probability is then: If A has happened, what is the 
probability that B also happens? Of course, if A and B are correlated events, then knowing 
one event can tell us something about the other event. If the two events have no relationship, 
knowing one event will not tell us anything about the other. 

In this section, we study the concept of conditional probability. There are three sub- 
topics in this section. We summarize the key points below. 
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The three main messages of this section are: 


e Section 2.4.1: Conditional probability. Conditional probability of A given B is 


P[A|B] = “Ee 


e Section 2.4.2: Independence. Two events are independent if the occurrence of 
one does not influence the occurrence of the other: P[A|B] = P[A]. 

e Section 2.4.3: Bayes’ theorem and the law of total probability. Bayes’ theorem 
allows us to switch the order of the conditioning: P[A|B] vs. P[B|A], whereas the 
law of total probability allows us to decompose an event into smaller events. 


2.4.1 Definition of conditional probability 
We start by defining conditional probability. 


Definition 2.22. Consider two events A and B. Assume P[B] 4 0. The conditional 
probability of A given B is 


PiA| a] (2.26) 


According to this definition, the conditional probability of A given B is the ratio of 
P[AN B] to P[B]. It is the probability that A happens when we know that B has already 
happened. Since B has already happened, the event that A has also happened is represented 
by AM B. However, since we are only interested in the relative probability of A with respect 
to B, we need to normalize using B. This can be seen by comparing P[A| B] and P[AN B]: 


P[AN B] 


P[AN B] 
P[B] , 


P[A| B] = Pia] 


and P[ANB] = (2.27) 


The difference is illustrated in Figure 2.26: The intersection P[AN B] calculates the overlap- 
ping area of the two events. We make no assumptions about the cause-effect relationship. 


P[ANB] P[An B} 
P[B| 
A B A B 
Figure 2.26: Illustration of conditional probability and its comparison with P[AN B]. 


What justifies this ratio? Suppose that B has already happened. Then, anything out- 
side B will immediately become irrelevant as far as the relationship between A and B is 
concerned. So when we ask: “What is the probability that A happens given that B has 
happened?”, we are effectively asking for the probability that AM B happens under the 
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condition that B has happened. Note that we need to consider AM B because we know 
that B has already happened. If we take A only, then there exists a region A\B which 
does not contain anything about B. However, since we know that B has happened, A\B is 
impossible. In other words, among the elements of A, only those that appear in ANM B are 
meaningful. 


Example 2.43. Let 


A = {Purdue gets Big Ten championship}, 
B = {Purdue wins 15 games consecutively}. 


In this example, 


P| 


Purdue won 15 games. 


If Purdue has won 15 games consecutively, then it is unlikely that Purdue will get 
the championship because the sample space of all possible competition results is large. 
However, if we have already won 15 games consecutively, then the denominator of the 
probability becomes much smaller. In this case, the conditional probability is high. 


Example 2.44. Consider throwing a die. Let 
A= {getting a3} and B= {getting an odd number}. 


Find P[A|B] and P[B| A]. 


Solution. The following probabilities are easy to calculate: 


P[A] = P[{3}] = -, and P[B] = P{{1,3,5}] = = 


Also, the intersection is 
1 
P[AN B] = P[{3}] = . 
Given these values, the conditional probability of A given B can be calculated as 


P[A|B] = a =$=3. 


In other words, if we know that we have an odd number, then the probability of 
obtaining a 3 has to be computed over {1,3,5}, which give us a probability rt If we 
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do not know that we have an odd number, then the probability of obtaining a 3 has 


to be computed from the sample space {1, 2,3,4,5,6}, which will give us z 


The other conditional probability is 
P[|AN BI 
P[A] 


P[B| A] = 


— 


Therefore, if we know that we have rolled a 3, then the probability for this number 
being an odd number is 1. 


Example 2.45. Consider the situation shown in Figure 2.27. There are 12 points 
with equal probabilities of happening. Find the probabilities P[A|B] and P[B|A]. 


Solution. In this example, we can first calculate the individual probabilities: 


2 
Pi] =, and P[ |=— and P[AN Bl = 55: 


Then the conditional probabilities are 


P[A|B] = 


P[B|A] = 


Figure 2.27: Visualization of Example 2.45: [Left] All the sets. [Middle] P(A|B) is the ratio between 
dots inside the light yellow region over those in yellow, which is 2. [Right] P[A|B] is the ratio between 
dots inside the light pink region over those in pink, which is 2. 


Example 2.46. Consider a tetrahedral (4-sided) die. Let X be the first roll and Y 
be the second roll. Let B be the event that min(X,Y) = 2 and M be the event that 
max(X,Y) = 3. Find P[M|B). 


Solution. As shown in Figure 2.28, the event B is highlighted in green. (Why?) 
Similarly, the event M is highlighted in blue. (Again, why?) Therefore, the probability 
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4 4 4 
3 3 3 
ig Y y 
2 2 2 
1 1 1 
1 2 3 4 1 2 3 4 1 2 3 @ 
XxX xX xX 


Figure 2.28: Visualization of Example 2.46. [Left] Event B. [Middle] Event M. [Right] P(/|B) is the 
ratio of the number of blue squares inside the green region to the total number of green squares, which 
2 

Is 5: 


Remark. Notice that if P[B] < P[Q], then P[A | B] is always larger than or equal to P[ANB], 
i.e., 
P[A|B] > P[AN B}. 


Conditional probabilities are legitimate probabilities 


Conditional probabilities are legitimate probabilities. That is, given B, the probability 
P[A|B] satisfies Axioms I, IJ, TI. 


Theorem 2.6. Let P[B] > 0. The conditional probability P[A| B] satisfies Axioms I, 


II, and IIL. 


Proof. Let’s check the axioms: 


e Axiom I: We want to show 


P[AN B] 


PIA|B|= Soe 


> 0. 
Since P[B] > 0 and Axiom I requires P[AM B] > 0, we therefore have P[A| B] > 0. 


e Axiom II: 


Pia| | == 
PB] _ | 


P[B) 
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e Axiom III: Consider two disjoint sets A and C. Then, 


P[AUC|B] 


= P[A|B] + P[C|B], 


where (a) holds because if A and C are disjoint then (AN B)N (CNB) = 9. 


To summarize this subsection, we highlight the essence of conditional probability. 


What are conditional probabilities? 


P[ANB] 
P[B) - 


e It is again a measure. It measures the relative size of A inside B. 


e Conditional probability of A given B is the ratio 


e Because it is a measure, it must satisfy the three axioms. 


2.4.2 Independence 


Conditional probability deals with situations where two events A and B are related. What 
if the two events are unrelated? In probability, we have a technical term for this situation: 
statistical independence. 


Definition 2.23. Two events A and B are statistically independent if 


P[AN B] = P[AP[B]. 


Why define independence in this way? Recall that P[A| B] = “537! If A and B are 
independent, then P[A/N B] = P[A] P[B] and so 


P[A|B] = ae = ae = PIA]. (2.29) 


This suggests an interpretation of independence: If the occurrence of B provides no addi- 
tional information about the occurrence of A, then A and B are independent. 
Therefore, we can define independence via conditional probability: 


Definition 2.24. Let A and B be two events such that P[A] > 0 and P[B| > 0. Then 
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A and B are independent if 


P[A|B]=P[A] or P[B| A] =P[B]. 


The two statements are equivalent as long as P[A] > 0 and P[B] > 0. This is because 
P[A|B] = P[An B]/P[B]. If P[A|B] = P[A] then P[ANM B] = P[AJP[B], which implies that 
P[B|A] = P[An BI/P[A] = PB]. 

A pictorial illustration of independence is given in Figure 2.29. The key message is that 
if two events A and B are independent, then P[A|B] = P[A]. The conditional probability 
P[A|B] is the ratio of P[AN B] over P[B], which is the intersection over B (the blue set). 
The probability P[A] is the yellow set over the sample space 2. 


PIANB] __ P[Ang| 
PB) Pa 
—— _ ia 
P[A|B] P[A] 


Figure 2.29: Independence means that the conditional probability P[A|B] is the same as P[A]. This 
implies that the ratio of PLAN B] over P[B], and the ratio of PLAN Q] over P[Q] are the same. 


Disjoint versus independent 


Disjoint < Independent. 


The statement says that disjoint and independent are two completely different concepts. 

If A and B are disjoint, then AN B = Q. This only implies that P[AN B] = 0. 
However, it says nothing about whether P[A™ B] can be factorized into P[A] P[B]. If A 
and B are independent, then we have P[AN B] = P[A] P[B]. But this does not imply that 
P[AM B] = 0. The only condition under which Disjoint = Independence is when P[A] = 0 or 
P[B] = 0. Figure 2.30 depicts the situation. When two sets are independent, the conditional 
probability (which is a ratio) remains unchanged compared to unconditioned probability. 
When two sets are disjoint, they simply do not overlap. 


Practice Exercise 2.15. Throw a die twice. Are A and B independent, where 
A= {lst die is 3} and B= {2nd die is 4}. 


Solution. We can show that 


P[AN B] =P(3,4)] = 4, 1 and P[B] =}. 


So P[AN B] = P[AJP[B]. Thus, A and B are independent. 
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ANB=0 


Figure 2.30: Independent means that the conditional probability, which is a ratio, is the same as the 
unconditioned probability. Disjoint means that the two sets do not overlap. 


1 2 3 4 5 6 1 2 3 4 5 6 
1 1 
2 2 
a - 
4 4 
5 5 
; PiAy| . [B] 


Figure 2.31: The two events A and B are independent because P(A] = 2 and P[A|B] = 2. 


A pictorial illustration of this example is shown in Figure 2.31. The two events are 
independent because A is one row in the 2D space, which yields a probability of z: The 
conditional probability P[A|B] is the coordinate (3,4) over the event B, which is a column. 
It happens that P[A|B] = z. Thus, the two events are independent. 

Practice Exercise 2.16. Throw a die twice. Are A and B independent? 


A= {lst die is 3} and B= {sum is 7}. 


Solution. Note that 


So P[AN B] = P[A] P[B]. Thus, A and B are independent. 


A pictorial illustration of this example is shown in Figure 2.32. Notice that whether the 
two events intersect is not how we determine independence (that only determines disjoint or 
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not). The key is whether the conditional probability (which is the ratio) remains unchanged 


compared to the unconditioned probability. 
1 2 3 4 5 6 1 2 3 4 5 6 
1 1 
2 2 
TTT » 
4 4 
5 5 
PIAL) “PIALBI 


Figure 2.32: The two events A and B are independent because P[A] = 2 and P[AN B] = z. 


If we let B = {sum is 8}, then the situation is different. The intersection AN B has a 
probability 4 relative to B, and therefore P[A|B] = 4. Hence, the two events A and B are 
dependent. If you like a more intuitive argument, you can imagine that B has happened, 
ie., the sum is 8. Then the probability for the first die to be 1 is 0 because there is no way 
to construct 8 when the first die is 1. As a result, we have eliminated one choice for the first 
die, leaving only five options. Therefore, since B has influenced the probability of A, they 
are dependent. 


Practice Exercise 2.17. Throw a die twice. Let 
A={maxis2} and B= {min is 2}. 
Are A and B independent? 
Solution. Let us first list out A and B: 
A= {(1, 2), (2,1), (2, 2)}, 
B= {(2, 2), (2, 3), (2, 4), (2, 5), (2, 6), (3, 2), ( , 2), (5, 2), (6, 2)}. 
Therefore, the probabilities are 


ak 
~ 36’ 


Clearly, P[AN B] 4 P[A]P[B] and so A and B are dependent. 


and P[ANB] =P{(2,2)] = = 


P[A] 


What is independence? 


e Two events are independent when the ratio P/|ANM B]/P[B] remains unchanged 
compared to P[A]. 


e Independence ¥ disjoint. 
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2.4.3 Bayes’ theorem and the law of total probability 


Theorem 2.7 (Bayes’ theorem). For any two events A and B such that P[A] > 0 
and P[B] > 0, 


P[B| A] P[A] 
P[B) 


P[A| B] = 


Proof. By the definition of conditional probabilities, we have 


ANB] 


P[A| B] = a Heil 


and P[B|A]= ea 


Rearranging the terms yields 
P[A| BIP[B] = P[B| A]P[A], 


which gives the desired result by dividing both sides by P[B]. 


Bayes’ theorem provides two views of the intersection P[AN B] using two different con- 
ditional probabilities. We call P|B| A] the conditional probability and P[A | B] the posterior 
probability. The order of A and B is arbitrary. We can also call P[A| B] the conditional 
probability and P[B | A] the posterior probability. The context of the problem will make this 
clear. 

Bayes’ theorem provides a way to switch P[A|B] and P[B|A]. The next theorem helps 
us decompose an event into smaller events. 


Theorem 2.8 (Law of Total Probability). Let {A,,...,A,} be a partition of Q, i.e., 
Aj,.-..,An are disjoint and Q = Ay U-:-UA,. Then, for any BCQ, 


P[B] = 5 P[B | Aj] P[Aj). (2.32) 


fil 


Proof. We start from the right-hand side. 


SPB Ad PIA] 2 $7 PBN A] 2 P| (Bn Ai) 
a1 i=1 i=1 
© p © pBng] = PIB), 


where (a) follows from the definition of conditional probability, (b) is due to Axiom III, (c) 
holds because of the distributive property of sets, and (d) results from the partition property 
of {Aj, Ao, Pere ,An}. 


Interpretation. The law of total probability can be understood as follows. If the sample 
space 2 consists of disjoint subsets A;,...,A,, we can compute the probability P[B] by 
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summing over its portion P[BN Ai],...,P[BNAn]. However, each intersection can be written 
as 
P[BN Aj] = P[B | Aj]P[Aj]. (2.33) 


In other words, we write P|[B NM Aj;] as the conditional probability P[B | A;] times the prior 
probability P[A;]. When we sum all these intersections, we obtain the overall probability. 
See Figure 2.33 for a graphical portrayal. 


P[BN Ad] PIBN Aj] 


P[Bn As] fis lV P[BN Aa] v. | Ag]P[A4] 
P[B | A3]P[A3] 


Figure 2.33: The law of total probability decomposes the probability P[B] into multiple conditional 
probabilities P[B | A]. The probability of obtaining each P[B| Aj] is P[Aj]. 


P[B | A2]P[A2] 
P[B | Ay]P[Ai] 


Corollary 2.4. Let {A;, Ao,..., An} be a partition of Q, i.e., Ay,...,An are disjoint 
and Q = A, UAgU-:-UA,. Then, for any BCQ, 


P[B| Aj] P[Aj] 


P[A; |B] = >, P[B| Ail P[Ag’ 


(2.34) 


Proof. The result follows directly from Bayes’ theorem: 


_ P[B|A;]P[Aj] — — P[B| Aj] P[Aj] 
PAI B= ~~ sia, = PIB] APIA 


Example 2.47. Suppose there are three types of players in a tennis tournament: A, 
B, and C. Fifty percent of the contestants in the tournament are A players, 25% are 
B players, and 25% are C players. Your chance of beating the contestants depends on 
the class of the player, as follows: 


0.3 against an A player 
0.4 against a B player 
0.5 against a C’ player 


If you play a match in this tournament, what is the probability of your winning the 
match? Supposing that you have won a match, what is the probability that you played 
against an A player? 


Solution. We first list all the known probabilities. We know from the percentage 
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of players that 
P[A] =0.5, P[B] =0.25, P[C] = 0.25. 


Now, let W be the event that you win the match. Then the conditional probabilities 
are defined as follows: 


P[W|A] =0.3, P[W|B]=0.4, P[W|C] =0.5. 


Therefore, by the law of total probability, we can show that the probability of 
winning the match is 
P[W] = P[W | AJP[A] + P[W | B) P[B] + P[W|C]P[C] 
= (0.3)(0.5) + (0.4)(0.25) + (0.5)(0.25) = 0.375. 


Given that you have won the match, the probability of A given W is 


PLAN) = Ee Oe 4 


Example 2.48. Consider the communication channel shown below. The probability 
of sending a 1 is p and the probability of sending a 0 is 1 — p. Given that 1 is sent, the 
probability of receiving 1 is 1 — 7. Given that 0 is sent, the probability of receiving 0 
is 1 — e. Find the probability that a 1 has been correctly received. 


l-e 
l—p 0 


Solution. Define the events 


So = “Oissent”, and Ro = “0 is received”. 


S, = “Lis sent”, and R, = “1 is received”. 
+ 


Then, the probability that 1 is received is P[ Ri]. However, P[Ri] 4 1—1 because 1— 7 
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is the conditional probability that 1 is received given that 1 is sent. It is possible that 
we receive 1 as a result of an error when 0 is sent. Therefore, we need to consider the 
probability that both Sg and 5S; occur. Using the law of total probability we have 


= (1—n)p+e(1 —p). 


Now, suppose that we have received 1. What is the probability that 1 was origi- 
nally sent? This is asking for the posterior probability P[S, | Ri], which can be found 
using Bayes’ theorem 


Pisa: ee 


P[R,] (=e Pel — a) 


When do we need to use Bayes’ theorem and the law of total probability? 
e Bayes’ theorem switches the role of the conditioning, from P[A|B] to P[B|A]. 


Example: 
P[{win the game | play with A] and P(play with A | win the game]. 


e The law of total probability decomposes an event into smaller events. 


Example: 
P[win] = P[win | AJP[A] + P[win | B]P[B]. 


2.4.4 The Three Prisoners problem 


Now that you are familiar with the concepts of conditional probabilities, we would like to 
challenge you with the following problem, known as the Three Prisoners problem. If you 
understand how this problem can be resolved, you have mastered conditional probability. 

Once upon a time, there were three prisoners A, B, and C.. One day, the king decided 
to pardon two of them and sentence the last one, as in this figure: 


a 


Figure 2.34: The Three Prisoners problem: The king says that he will pardon two prisoners and sentence 
one. 


A ; C 


One of the prisoners, prisoner A, heard the news and wanted to ask a friendly guard 
about his situation. The guard was honest. He was allowed to tell prisoner A that prisoner B 
would be pardoned or that prisoner C' would be pardoned, but he could not tell A whether 
he would be pardoned. Prisoner A thought about the problem, and he began to hesitate to 
ask the guard. Based on his present state of knowledge, his probability of being pardoned 
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is 2. However, if he asks the guard, this probability will be reduced to 5 because the guard 
would tell him that one of the two other prisoners would be pardoned, and would tell him 
which one it would be. Prisoner A reasons that his chance of being pardoned would then 
drop because there are now only two prisoners left who may be pardoned, as illustrated in 
Figure 2.35: 


ie probability of release = 5 


w., guard probability of release = 4 


2 


Figure 2.35: The Three Prisoners problem: If you do not ask the guard, your chance of being released 
is 2/3. If you ask the guard, the guard will tell you which one of the other prisoners will be released. 
Your chance of being released apparently drops to 1/2. 


Should prisoner A ask the guard? What has gone wrong with his reasoning? This 
problem is tricky in the sense that the verbal argument of prisoner A seems flawless. If 
he asked the guard, indeed, the game would be reduced to two people. However, this does 
not seem correct, because regardless of what the guard says, the probability for A to be 
pardoned should remain unchanged. Let’s see how we can solve this puzzle. 

Let X4, Xp, Xc be the events of sentencing prisoners A, B, C, respectively. Let Gg 
be the event that the guard says that the prisoner B is released. Without doing anything, 
we know that 

1 1 
=a) P[Xo] = 3° 
Conditioned on these events, we can compute the following conditional probabilities that 
the guard says B is pardoned: 


PIXa)=5, PIX 


P[Gg | Xa] P[Gg | Xz] = 0, P[Gg | Xe] = 1. 


1 
a 
Why are these conditional probabilities? P[Gg | Xp] = 0 quite straightforward. If the king 
decides to sentence B, the guard has no way of saying that B will be pardoned. Therefore, 
P[Gg | Xgl] must be zero. P[iGg | Xc] = 1 is also not difficult. If the king decides to 
sentence C’, then the guard has no way to tell you that B will be pardoned because the 
guard cannot say anything about prisoner A. Finally, P[Gg | X4] = 4 can be understood 
as follows: If the king decides to sentence A, the guard can either tell you B or C. In other 
words, the guard flips a coin. 

With these conditional probabilities ready, we can determine the probability. This is the 
conditional probability P[X 4 | Gg]. That is, supposing that the guard says B is pardoned, 
what is the probability that A will be sentenced? This is the actual scenario that A is facing. 
Solving for this conditional probability is not difficult. By Bayes’ theorem we know that 


P[Gg | Xa]P[Xa] 


P[X4 | Ga] = PiGsl ; 
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and P[Gg] = P[Ga|Xa]P[|X4] + P[Ga|Xe]P|Xe] + P[Ga|Xc|P|Xc] according to the law of 
total probability. Substituting the numbers into these equations, we have that 


P[Gz] = PiGa|XaJP[Xa] + PiGa|Xz)P[Xa] + PiGa|Xc}P[Xc] 


ult ee eiige  ™ 
2° 3 3 3° 2? 
P[Gp | XajP[Xa] _ 5 x 
P[G3| 7 3 


Therefore, given that the guard says B is pardoned, the probability that A will be sentenced 
remains 4. In fact, what you can show in this example is that PLX4 | Gg] = % = P[Xa]. 
Therefore, the presence or absence of the guard does not alter the probability. This is because 
what the guard says is independent of whether the prisoners will be pardoned. The lesson 
we learn from this problem is not to rely on verbal arguments. We need to write down the 
conditional probabilities and spell out the steps. 


Xo 


\\ 


Gar 


VY 


P[XaNGp] _ 1 
ne 3 


Pix a) = 
\s @~ equal ns 
xy ) . 


guard 


P[X4|Ga] = 


Figure 2.36: The Three Prisoners problem is resolved by noting that PLX4|Gza] = P[Xa]. Therefore, 
the events X 4 and Gp are independent. 


How to resolve the Three Prisoners problem? 


e The key is that G4, Gg, Gc do not form a partition. See Figure 2.36. 


e Gg # Xg. When Gz happens, the remaining set is not X 4 U Xc. 
e The ratio P[X4 1 Gg]/P[Gg] equals P[X 4]. This is independence. 
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2.5 Summary 


By now, we hope that you have become familiar with our slogan probability is a measure 
of the size of a set. Let us summarize: 


e Probability = a probability law P. You can also view it as the value returned by P. 


e Measure = aruler, a scale, a stopwatch, or another measuring device. It is a tool that 
tells you how large or small a set is. The measure has to be compatible with the set. 
If a set is finite, then the measure can be a counter. If a set is a continuous interval, 
then the measure can be the length of the interval. 


e Size = the relative weight of the set for the sample space. Measuring the size is done 
by using a weighting function. Think of a fair coin versus a biased coin. The former 
has a uniform weight, whereas the latter has a nonuniform weight. 


e Set = an event. An event is a subset in the sample space. A probability law P always 
maps a set to a number. This is different from a typical function that maps a number 
to another number. 


If you understand what this slogan means, you will understand why probability can be 
applied to discrete events, continuous events, events in n-D spaces, etc. You will also under- 
stand the notion of measure zero and the notion of almost sure. These concepts lie at the 
foundation of modern data science, in particular, theoretical machine learning. 

The second half of this chapter discusses the concept of conditional probability. Con- 
ditional probability is a metaconcept that can be applied to any measure you use. The 
motivation of conditional probability is to restrict the probability to a subevent happening 
in the sample space. If B has happened, the probability for A to also happen is P/[ANB]/P[B]. 
If two events are not influencing each other, then we say that A and B are independent. 
According to Bayes’ theorem, we can also switch the order of A given B and B given A, ac- 
cording to Bayes’ theorem. Finally, the law of total probability gives us a way to decompose 
events into subevents. 

We end this chapter by mentioning a few terms related to conditional probabilities 
that will become useful later. Let us use the tennis tournament as an example: 


e P[W | A] = conditional probability = Given that you played with player A, what is 
the probability that you will win? 

e P[A] = prior probability = Without even entering the game, what is the chance that 
you will face player A? 


e P[A|W] = posterior probability = After you have won the game, what is the proba- 
bility that you have actually played with A? 


In many practical engineering problems, the question of interest is often the last one. That 
is, supposing that you have observed something, what is the most likely cause of that event? 
For example, supposing we have observed this particular dataset, what is the best Gaussian 
model that would fit the dataset? Questions like these require some analysis of conditional 
probability, prior probability, and posterior probability. 
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2.7 Problems 


Exercise 1. 
A space S and three of its subsets are given by S = {1,3,5,7,9,11}, A = {1,3,5}, B= 
{7,9, 11}, and C = {1,3,9, 11}. Find AN BNC, ACN B, A—C, and (A— B)UB. 


Exercise 2. 
Let A = (—oo,r] and B = (—o, s] where r < s. Find an expression for C' = (r, s] in terms 
of A and B. Show that B= AUC, and ANC =9. 


Exercise 3. (VIDEO SOLUTION) 
Simplify the following sets. 


(a) [1,4] 9 ([0, 2] U [3, 5)) 
(b) ([0, 1] U [2, 3))° 

(c) M21(-1/n, +1/n) 

(d) Uji 15,8 = (22)~"] 


Exercise 4. 
We will sometimes deal with the relationship between two sets. We say that A implies B 
when A is a subset of B (why?). Show the following results. 


(a) Show that if A implies B, and B implies C, then A implies C. 
(b) Show that if A implies B, then B° implies A‘. 


Exercise 5. 
Show that if AU B= A and AN B=A, then A= B. 


Exercise 6. 
A space S is defined as S = {1,3,5,7,9,22}, and three subsets as A = {1,3,5}, B = 
{7,9, 11}, C = {1,3,9, 11}. Assume that each element has probability 1/6. Find the following 
probabilities: 
(a) P[A| 
(b) PIB] 
(c) P[C] 
(d) P[AU B] 
) 
) 


Cc 


e) P[AUC] 


P[(A\C) U Bl 


f 


( 
( 
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Exercise 7. (VIDEO SOLUTION) 
A collection of 26 letters, a-z, is mixed in a jar. Two letters are drawn at random, one after 
the other. What is the probability of drawing a vowel (a,e,i,o,u) and a consonant in either 
order? What is the sample space? 


Exercise 8. 

Consider an experiment consisting of rolling a die twice. The outcome of this experiment is 
an ordered pair whose first element is the first value rolled and whose second element is the 
second value rolled. 


(a) Find the sample space. 


(b) Find the set A representing the event that the value on the first roll is greater than 
or equal to the value on the second roll. 


(c) Find the set B corresponding to the event that the first roll is a six. 


(d) Let C correspond to the event that the first valued rolled and the second value rolled 
differ by two. Find ANC. 


Note that A, B, and C should be subsets of the sample space specified in Part (a). 


Exercise 9. 
A pair of dice are rolled. 


(a) Find the sample space 2 


(b) Find the probabilities of the events: (i) the sum is even, (ii) the first roll is equal to 
the second, (iii) the first roll is larger than the second. 


Exercise 10. 
Let A, B and C be events in an event space. Find expressions for the following: 


(a) Exactly one of the three events occurs. 
( 


) 

b) Exactly two of the events occurs. 

(c) Two or more of the events occur. 
) 


(d) None of the events occur. 


Exercise 11. 

A system is composed of five components, each of which is either working or failed. Consider 
an experiment that consists of observing the status of each component, and let the outcomes 
of the experiment be given by all vectors (#1, 22,23, 24,25), where x; is 1 if component 7 is 
working and 0 if component 7 is not working. 


(a) How many outcomes are in the sample space of this experiment? 


(b) Suppose that the system will work if components 1 and 2 are both working, or if 
components 3 and 4 are both working, or if components 1, 3, and 5 are all working. 
Let W be the event that the system will work. Specify all of the outcomes in W. 
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(c) Let A be the event that components 4 and 5 have both failed. How many outcomes 
are in the event A? 


(d) Write out all outcomes in the event AN W. 


Exercise 12. (VIDEO SOLUTION) 

A number « is selected at random in the interval [—1,2]. Let the events A = {x|a < 0}, 
B= {x||x —0.5| < 0.5}, C = {x|a > 0.75}. Find (a) P[A| B], (b) P[B| C], (c) P[A|C*, 
(d) P[B|C*}. 


Exercise 13. (VIDEO SOLUTION) 
Let the events A and B have P[A] = x, P[B] = y and P[AU B] = z. Find the following 
probabilities: (a) P[AN B], (b) P[A°N B°], (c) P[A®U B*, (d) P[AN B‘, (e) P[A® U B]. 


Exercise 14. 


(a) By using the fact that P[AU B] < P[A]+P[B], show that P[AU BUC] < P[A]+P[B]+ 
P[C]. 


(b) By using the fact that P[U;_, An] < oy_, P[Ax], show that 


ae 
k=1 


n 


>1-) P[Af]. 
k 


=1 


P 


Exercise 15. 
Use the distributive property of set operations to prove the following generalized distributive 


law: 
AU ( a = () (AU B). 


i=l i=1 
Hint: Use mathematical induction. That is, show that the above is true for n = 2 and that 
it is also true for n = k + 1 when it is true for n =k. 


Exercise 16. 
The following result is known as the Bonferroni’s Inequality. 


(a) Prove that for any two events A and B, we have 


P(AN B) > P(A) +P(B) -1. 


(b) Generalize the above to the case of n events Ai, Ag,...,An, by showing that 
P(A, AQN +++ An) 2 P(A1) + P(Az) +--+ + P(An) — (n—- 1). 


Hint: You may use the generalized Union Bound P(U;_, Ai) < S07_, P(Ai). 


Exercise 17. (VIDEO SOLUTION) 
Let A, B, C' be events with probabilities P[A] = 0.5, P[B] = 0.2, P[C] = 0.4. Find 
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P|A U B] if A and B are independent. 
P[A U B] if A and B are disjoint. 
[ 
[ 


P[AU BUC] if A, B and C are independent. 


P[AU BUC] if A, B and C are pairwise disjoint; can this happen? 


Exercise 18. (VIDEO SOLUTION) 

A block of information is transmitted repeated over a noisy channel until an error-free block 
is received. Let M > 1 be the number of blocks required for a transmission. Define the 
following sets. 


(i) A = {M is even} 
(ii) B = {M is a multiple of 3} 
(iii) C = {M is less than or equal to 6} 


Assume that the probability of requiring one additional block is half of the probability 
without the additional block. That is: 


Determine the following probabilities. 


(a) P[A], P[B], P[C], P[C*] 

(b) P[AN B], P[A\B], PIAN BNC] 
(c) P[A| B], P[B| A] 

(d) P[A| BNC], PIANB|C 


Exercise 19. (VIDEO SOLUTION) 

A binary communication system transmits a signal X that is either a +2-voltage signal or 
a —2-voltage signal. A malicious channel reduces the magnitude of the received signal by 
the number of heads it counts in two tosses of a coin. Let Y be the resulting signal. Possible 
values of Y are listed below. 


2 Heads 1Head No Head 


X=-2}] Y=0 Y=-1l Y=-2 
X=42}] Y=0 Y=41 Y=+42 


Assume that the probability of having X = +2 and X = —2 is equal. 
(a) Find the sample space of Y, and hence the probability of each value of Y. 
(b) What are the probabilities PLX = +2|Y = 1] and P[Y =1|X = —2]? 


Exercise 20. (VIDEO SOLUTION) 
A block of 100 bits is transmitted over a binary communication channel with a probability 
of bit error p = 1072. 
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(a) If the block has 1 or fewer errors, then the receiver accepts the block. Find the prob- 
ability that the block is accepted. 


(b) If the block has more than 1 error, then the block is retransmitted. What is the 
probability that 4 blocks are transmitted? 


Exercise 21. (VIDEO SOLUTION) 
A machine makes errors in a certain operation with probability p. There are two types of 
errors. The fraction of errors that are type A is a and the fraction that are type B is 1—a. 


What is the probability of k errors in n operations? 
What is the probability of k, type A errors in n operations? 
What is the probability of kz type B errors in n operations? 


What is the joint probability of k; type A errors and kz type B errors in n operations? 
Hint: There are ( hk, ) ee) possibilities of having k; type A errors and kz type B errors 
in n operations. (Why?) 


Exercise 22. (VIDEO SOLUTION) 

A computer manufacturer uses chips from three sources. Chips from sources A, B and C 
are defective with probabilities 0.005, 0.001 and 0.01, respectively. The proportions of chips 
from A, B and C are 0.5, 0.1 and 0.4 respectively. If a randomly selected chip is found to 
be defective, find 


(a) the probability that the chips are from A. 
(b) the probability that the chips are from B. 


(c) the probability that the chips are from C. 


Exercise 23. (VIDEO SOLUTION) 

In a lot of 100 items, 50 items are defective. Suppose that m items are selected for testing. 
We say that the manufacturing process is malfunctioning if the probability that one or more 
items are tested to be defective. Call this failure probability p. What should be the minimum 
m such that p > 0.99? 


Exercise 24. (VIDEO SOLUTION) 
One of two coins is selected at random and tossed three times. The first coin comes up heads 
with probability p; = 1/3 and the second coin with probability pg = 2/3. 


(a) What is the probability that the number of heads is k = 3? 
(b) Repeat (a) for k = 0,1, 2. 


(c) Find the probability that coin 1 was tossed given that k heads were observed, for 
k = 0,1, 2,3. 


(d) In part (c), which coin is more probably when 2 heads have been observed? 
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Exercise 25. (VIDEO SOLUTION) 

Consider the following communication channel. A source transmits a string of binary symbols 
through a noisy communication channel. Each symbol is 0 or 1 with probability p and 
1 —p, respectively, and is received incorrectly with probability ¢g and €,. Errors in different 
symbols transmissions are independent. 


l-¢« 


Denote S as the source and R as the receiver. 
(a) What is the probability that a symbol is correctly received? Hint: Find 


P[IR=1NS=1] and P[R=0NS=Q. 


(b) Find the probability of receiving 1011 conditioned on that 1011 was sent, i.e., 


P[R = 1011|S = 1011]. 


(c) To improve reliability, each symbol is transmitted three times, and the received 
string is decoded by the majority rule. In other words, a 0 (or 1) is transmitted as 
000 (or 111, respectively), and it is decoded at the receiver as a 0 (or 1) if and only if 
the received three-symbol string contains at least two Os (or 1s, respectively). What 
is the probability that the symbol is correctly decoded, given that we send a 0? 


ry 
= 


Suppose that the scheme of part (c) is used. What is the probability that a 0 was 
sent if the string 101 was received? 


— 
oO 
Nase 


Suppose the scheme of part (c) is used and given that a 0 was sent. For what value of 
€o is there an improvement in the probability of correct decoding? Assume that 
EO x 0. 
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Discrete Random Variables 


When working on a data analysis problem, one of the biggest challenges is the disparity 
between the theoretical tools we learn in school and the actual data our boss hands to us. 
By actual data, we mean a collection of numbers, perhaps organized or perhaps not. When 
we are given the dataset, the first thing we do would certainly not be to define the Borel 
o-field and then define the measure. Instead, we would normally compute the mean, the 
standard deviation, and perhaps some scores about the skewness. 

The situation is best explained by the landscape shown in Figure 3.1. On the one hand, 
we have well-defined probability tools, but on the other hand, we have a set of practical 
“battle skills” for processing data. Often we view them as two separate entities. As long as 
we can pull the statistics from the dataset, why bother about the theory? Alternatively, we 
have a set of theories, but we will never verify them using the actual datasets. How can we 
bridge the two? What are the missing steps in the probability theory we have learned so 
far? The goal of this chapter (and the next) is to fill this gap. 


histogram 


holla. 


Gee dataset 


probability mass function 


TR 


random 
variable 


Probability 


Figure 3.1: The landscape of probability and data. Often we view probability and data analysis as two 
different entities. However, probability and data analysis are inseparable. The goal of this chapter is to 
link the two. 


Three concepts to bridge the gap between theory and practice 


The starting point of our discussion is a probability space (Q, F,P). It is an abstract concept, 
but we hope we have convinced you in Chapter 2 of its significance. However, the probability 
space is certainly not “user friendly” because no one would write a Python program to 
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implement those theories. How do we make the abstract probability space more convenient 
so that we can model practical scenarios? 

The first step is to recognize that the sample space and the event space are all based 
on statements, for example, “getting a head when flipping a coin” or “winning the game.” 
These statements are not numbers, but we (engineers) love numbers. Therefore, we should 
ask a very basic question: How do we convert a statement to a number? The answer is the 
concept of random variables. 


Key Concept 1: What are random variables? 


Random variables are mappings from events to numbers. 


Now, suppose that we have constructed a random variable that translates statements to 
numbers. The next task is to endow the random variable with probabilities. More precisely, 
we need to assign probabilities to the random variable so that we can perform computations. 
This is done using the concept called probability mass function (PMF). 


Key Concept 2: What are probability mass functions (PMFs)? 


Probability mass functions are the ideal histograms of random variables. 


The best way to think about a PMF is a histogram, something we are familiar with. 
A histogram has two axes: The z-axis denotes the set of states and the y-axis denotes 
the probability. For each of the states that the random variable possesses, the histogram 
tells us the probability of getting a particular state. The PMF is the ideal histogram of a 
random variable. It provides a complete characterization of the random variable. If you have 
a random variable, you must specify its PMF. Vice versa, if you tell us the PMF, you have 
specified a random variable. 

We ask the third question about pulling information from the probability mass func- 
tion, such as the mean and standard deviation. How do we obtain these numbers from the 
PMF? We are also interested in operations on the mean and standard deviations. For ex- 
ample, if a professor offers ten bonus points to the entire class, how will it affect the mean 
and standard deviation? If a store provides 20% off on all its products, what will happen to 
its mean retail price and standard deviation? However, the biggest question is perhaps the 
difference between the mean we obtain from a PMF and the mean we obtain from a his- 
togram. Understanding this difference will immediately help us build a bridge from theory 
to practice. 


Key Concept 3: What is expectation? 


Expectation = Mean = Average computed from a PMF. 


Organization of this chapter 


The plan for this chapter is as follows. We will start with the basic concepts of random 
variables in Section 3.1. We will formally define the random variables and discuss their 
relationship with the abstract probability space. Once this linkage is built, we can put 
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the abstract probability space aside and focus on the random variables. In Section 3.2 
we will define the probability mass function (PMF) of a random variable, which tells us 
the probability of obtaining a state of the random variable. PMF is closely related to the 
histogram of a dataset. We will explain the connection. In Section 3.3 we take a small detour 
to consider the cumulative distribution functions (CDF). Then, we discuss the mean and 
standard deviation in Section 3.4. Section 3.5 details a few commonly used random variables, 
including Bernoulli, binomial, geometric, and Poisson variables. 


3.1 Random Variables 


3.1.1 A motivating example 


Consider an experiment with 4 outcomes Q = {&,>,0,@}. We want to construct the 
probability space (0,7,P). The sample space 2 is already defined. The event space F is the 
set of all possible subsets in 2, which, in our case, is a set of 24 subsets. For the probability 
law P, let us assume that the probability of obtaining each outcome is 


PUSH =4, PKOH=2, PEON =3, Playl=t. 


Therefore, we have constructed a probability space (Q,.*,P) where everything is perfectly 
defined. So, in principle, they can live together happily forever. 

A lazy data scientist comes, and there is a (small) problem. The data scientist does not 
want to write the symbols &,>,V,@. There is nothing wrong with his motivation because 
all of us want efficiency. How can we help him? Well, the easiest solution is to encode each 
symbol with a number, for example, & < 1,0 < 2,9 <— 3, @ < 4, where the arrow means 
that we assign a number to the symbol. But we can express this more formally by defining 
a function X :Q— R with 


There is nothing new here: we have merely converted the symbols to numbers, with the help 
of a function X. However, with X defined, the probabilities can be written as 


This is much more convenient, and so the data scientist is happy. 


3.1.2 Definition of a random variable 


The story above is exactly the motivation for random variables. Let us define a random 
variable formally. 


Definition 3.1. A random variable X is a function X :Q— R that maps an outcome 


€ €L) to a number X(E) on the real line. 
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This definition may be puzzling at first glance. Why should we overcomplicate things by 
defining a function and calling it a variable? 

If you recall the story above, we can map the notations of the story to the notations 
of the definition as follows. 


Symbol Meaning 


Q sample space = the set containing &,>,0,@ 

€ an element in the sample space, which is one of #,>,0,@ 

xX a function that maps & to the number 1, > to the number 2, etc 
X(€) a number on the real line, e.g., X(#) = 1 


This explains our informal definition of random variables: 


Key Concept 1: What are random variables? 


Random variables are mappings from events to numbers. 


The random variable X is a function. The input to the function is an outcome of the sample 
space, whereas the output is a number on the real line. This type of function is somewhat 
different from an ordinary function that often translates a number to another number. 
Nevertheless, X is a function. 


Gare. 
ae 
=< 
Say 


Sample Space 2 


Figure 3.2: A random variable is a mapping from the outcomes in the sample space to numbers on the 
real line. We can think of a random variable X as a translator that translates a statement to a number. 


Why do we call this function X a variable? X is a variable because X has multiple 
states. As we illustrate in Figure 3.2, the mapping X translates every outcome € to a 
number. There are multiple numbers, which are the states of X. Each state has a certain 
probability for X to land on. Because X is not deterministic, we call it a random variable. 


Example 3.1. Suppose we flip a fair coin so that 2 = {head, tail}. We can define the 
random variable X :Q— R as 


X (head) = 1, and X (tail) = 0. 
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Therefore, when we write P|X = 1] we actually mean P[{head}}. Is there any difference 
between P[{Head}] and PLX = 1]? No, because they are describing two identical events. 
Note that the assignment of the value is totally up to you. You can say “head” is equal 


Example 3. 


to the value 102. This is allowed and legitimate, but it isn’t very convenient. 


2. Flip a coin 2 times. The sample space 22 is 


Q = {(head, head), (head, tail), (tail, head), (tail, tail) }. 


Suppose that X is a random variable that maps an outcome to a number representing 
the sum of “head,” i-e., 


X(-) = number of heads. 


Then, for the 4 €’s in the sample space there are only 3 distinct numbers. More precisely, 
if we let €; = (head, head), 2 = (head, tail), €; = (tail, head), €, = (tail, tail), then, 


we have 


X(&1) = 2, X (2) =1, X (3) =1, X (£4) = 0. 


A pictorial illustration of this random variable is shown in Figure 3.3. This example 


shows that the mapping defined by the random variable is not necessarily a one-to-one 
mapping because multiple outcomes can be mapped to the same number. 


Sample Space 2 


Figure 3.3: A random variable that maps a pair of coins to a number, where the number represents the 


number of heads. 


3.1.3. Probability measure on random variables 


By now, we hope that you understand Key Concept 1: A random variable is a mapping 
from a statement to a number. However, we are now facing another difficulty. We knew 
how to measure the size of an event using the probability law P because P(-) takes an event 
E € F and sends it to a number between [0, 1]. After the translation X, we cannot send the 


output X(€) to 


when we write I 


P(-) because P(-) “eats” a set E € F and not a number X(£) € R. Therefore, 


P|X = 1], how do we measure the size of the event X = 1? 
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This question appears difficult but is actually quite easy to answer. Since the prob- 
ability law P(-) is always applied to an event, we need to define an event for the random 
variable X. If we write the sets clearly, we note that “X =a” is equivalent to the set 


B= {een]x(@ =a. 


This is the set that contains all possible €’s such that X(€) = a. Therefore, when we say 
“find the probability of X = a,” we are effectively asking the size of the set FE = {€ € 
OX (©) =a}. 

How then do we measure the size of E? Since E is a subset in the sample space, EF is 
measurable by P. All we need to do is to determine what F is for a given a. This, in turn, 
requires us to find the pre-image X~!(a), which is defined as 


Xa) & {é eo | X(é)= al, 
Wait a minute, is this set just equal to EF? Yes, the event EF’ we are seeking is exactly the 


pre-image X~'(a). As such, the probability measure of F is 


Figure 3.4 illustrates a situation where two outcomes €, and £) are mapped to the same 
value a on the real line. The corresponding event is the set X~'(a) = {&1, £9}. 


X~*(a) = {&1,€2} 


Sample Space 2 


Figure 3.4: When computing the probability of P[{€ € | X(€) = a}], we effectively take the inverse 
mapping X~'(a) and compute the probability of the event P[{€é € X~*(a)}] = P[{&1, €2}). 


Example 3.3. Suppose we throw a die. The sample space is 


Q = {1,2,3,4,5, 6}. 


There is a natural mapping X that maps X(1) = 1, X(2) = 2 and so on. Thus, 
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©] P[X =1)+PLX =2)+P[X =3] 
PIX 1(1)] + Pix") +P(x-1@)] 
P((.}] + PL2}] + PIB} = >. 


In this derivation, step (a) is based on Axiom III, where the three events are disjoint. 
Step (b) is the pre-image due to the random variable X. Step (c) is the list of ac- 
tual events in the event space. Note that there is no hand-waving argument in this 
derivation. Every step is justified by the concepts and theorems we have learned so 
far. 


Example 3.4. Throw a die twice. The sample space is then 


S41, G2). 566). 


These elements can be translated to 36 outcomes: 
Si oo (le 1), &2 a (i 2), tee , £36 = (6, 6). 


Let. 
X = sum of two numbers. 


Then, if we want to find the probability of getting X = 7, we can trace back and ask: 
Among the 36 outcomes, which of those §&;’s will give us X(€) = 7? Or, what is the set 
X~1(7)? To this end, we can write 


= PI{(1, 6), (2,5), (3, 4), (4,3), 3, 2), (6, 1) 
) 


= P{(1,6)] + P{(2,5)] + PIG, 4)] 
+ P[(4,3)] + a ,2)| 


(3 
+ P[(6, 1) 
1 


Again, in this example, you can see that all the steps are fully justified by the concepts 
we have learned so far. 


Closing remark. In practice, when the problem is clearly defined, we can skip the inverse 
mapping X~!(a). However, this does not mean that the probability triplet (Q, 7, P) is gone; 
it is still present. The triplet is now just the background of the problem. 

The set of all possible values returned by X is denoted as X(Q). Since X is not 
necessarily a bijection, the size of X(Q) is not necessarily the same as the size of Q. The 
elements in X(Q) are often denoted as a or x. We call a or x one of the states of X. Be 
careful not to confuse « and X. The variable X is the random variable; it is a function. 
The variable x is a state assigned by X. A random variable X has multiple states. When 
we write P[X = 2], we describe the probability of a random variable X taking a particular 
state x. It is exactly the same as P[{€ € Q| X(€) = x}}. 
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3.2 Probability Mass Function 


Random variables are mappings that translate events to numbers. After the translation, 
we have a set of numbers denoting the states of the random variables. Each state has a 
different probability of occurring. The probabilities are summarized by a function known as 
the probability mass function (PMF). 


3.2.1 Definition of probability mass function 


Definition 3.2. The probability mass function (PMF) of a random variable X is a 
function which specifies the probability of obtaining a number X(€) = x. We denote a 
PMF as 


(sea) PNG (3.1) 
The set of all possible states of X is denoted as X(Q). 


Do not get confused by the sample space Q and the set of states X(Q). The sample space Q 
contains all the possible outcomes of the experiments, whereas X(Q) is the translation by 
the mapping X. The event X = a is the set X~'(a) C 2. Therefore, when we say P[X = 2] 
we really mean P[X~1!(a)]. 

The probability mass function is a histogram summarizing the probability of each of 
the states X takes. Since it is a histogram, a PMF can be easily drawn as a bar chart. 


Example 3.5. Flip a coin twice. The sample space is 2 = {HH, HT, TH, TT}. We 
can assign a random variable X = number of heads. Therefore, 


2 (Asal) ena?) a il (St) al (ama) 0), 
So the random variable X takes three states: 0, 1, 2. The PMF is therefore 


Heit de 


1 
Pi{ 10a el bat = = 


PH} = 5. 


3.2.2. PMEF and probability measure 


In Chapter 2, we learned that probability is a measure of the size of a set. We introduced a 
weighting function that weights each of the elements in the set. The PMF is the weighing 
function for discrete random variables. Two random variables are different when their PMF's 
are different because they are constructing two different measures. 
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To illustrate the idea, suppose there are two dice. They each have probability masses 
as follows. 


PQ} = =, PH = =, Ps} = +, P44} =, Pits} = —, Pie =—, 
12 12 12 12 12 12 


2 2 2 2 2 2 
P({1}] = ae P[{2}] = 13" P[{3}] = Ee P[{4}] = ia P({5}] = 7a’ P[{6}] = Ta 
Let us define two random variables, X and Y, for the two dice. Then, the PMFs px and py 
can be defined as 


1 2 3 4 1 1 
2: 2 2 2 2 2 
py (1) 12’ py (2) 13’ py (3) 12’ py (4) 12’ py (5) 12’ py (6) D 


These two probability mass functions correspond to two different probability measures, let’s 
say F and G. Define the event E = {between 2 and 3}. Then, F(£) and G(£) will lead to 
two different results: 


1 2 3 
F(£) =P[2< X <3) = 2 = = 
(EF) [2< X < 3] = px(2)+ px(3) ptDp 12’ 
2 2 4 
=p < < = = =— —_ = —, 
G(E) = P[2 < Y < 3] = py (2) + py(3) as a 


Note that even though for some particular events two final results could be the same (e.g., 
2<X<4and2<Y <4), the underlying measures are completely different. 

Figure 3.5 shows another example of two different measures F and G on the same 
sample space 2 = {&,>,0,@}. Since the PMFs of the two measures are different, even 
when given the same event FE, the resulting probabilities will be different. 


Figure 3.5: If we want to measure the size of a set F, using two different PMFs is equivalent to using 
two different measures. Therefore, the probabilities will be different. 


Does px = py imply X = Y? If two random variables X and Y have the same PMF, 
does it mean that the random variables are the same? The answer is no. Consider a random 
variable with a symmetric PMF, e.g., 

1 
px(0)=5, px(1)= 
Suppose Y = —X. Then, py(—1) = 3, py (0) = 3, and py (1) = 7, which is the same as px. 
However, X and Y are two different random variables. If the sample space is {#, >, V}, we 
can define the mappings X(-) and Y(-) as 


X (#) =-1, X(%) =0, X 
¥(&)=41, Y(0)=0, Y(Y)=-1. 


(3.2) 
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Therefore, when we say px (—1) = i, the underlying event is &. But when we say py (—1) = i, 
the underlying event is Y. The two random variables are different, although their PMF's have 
exactly the same shape. 


3.2.3. Normalization property 


Here we must mention one important property of a probability mass function. This property 
is known as the normalization property, which is a useful tool for a sanity check. 


Theorem 3.1. A PMF should satisfy the condition that 


SS) px(@)=1. 


EX (Q) 


Proof. The proof follows directly from Axiom II, which states that P[Q] = 1. Since x covers 
all numerical values X can take, and since each x is distinct, by Axiom III we have 


So PX aal= SO PiéeQ|X( =2}] 


we X(Q) 2eEX(Q) 


=P ||) {€e0|xX@) =2}| =P] =1. 


£EQ 


Practice Exercise 3.1. Let px(k) =c eae where k = 1,2,.... Find c. 


Solution. Since }7,¢x(q) Px(k) = 1, we must have 


> ( 


kl 


Evaluating the geometric series on the right-hand side, we can show that 


(2) 


ik 


Practice Exercise 3.2. Let py(k) =c: sin (3k), where k = 1,2,.... Find c. 
Solution. The reader may might be tempted to sum px(k) over all the possible k’s: 


Y sin (Fe) Siete as Lay 
(a1) 
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However, a more careful inspection reveals that px(k) is actually negative when k = 
3,7,11,.... This cannot happen because a probability mass function must be non- 
negative. Therefore, the problem is not defined, and so there is no solution. 


0.5 


a Py -c ”  O“ 
0.5; 4 
= 0.25} = 0 e e e e e 
a a 
0.125} ? 0.5 | | 
0.0625 + [ 
Peeecee Pe 
12 3 4 5 6 7 8 9 10 12 3 4 5 6 7 8 9 10 
k k 
(a) (b) 
Figure 3.6: (a) The PMF of px(k) = ay", for k = 1,2,.... (b) The PMF of px(k) = sin (Zk), 
where & = 1,2,.... Note that this is not a valid PMF because probability cannot have negative values. 


3.2.4 PMEF versus histogram 


PMFs are closely related to histograms. A histogram is a plot that shows the frequency of 
a state. As we see in Figure 3.6, the x-axis is a collection of states, whereas the y-axis is 
the frequency. So a PMF is indeed a histogram. 

Viewing a PMF as a histogram can help us understand a random variable. For better 
or worse, treating a random variable as a histogram could help you differentiate a random 
variable from a variable. An ordinary variable only has one state, but a random variable 
has multiple states. At any particular instance, we do not know which state will show up 
before our observation. However, we do know the probability. For example, in the coin-flip 
example, while we do not know whether we will get “HH,” we know that the chance of 
getting “HH” is 1/4. Of course, having a probability of 1/4 does not mean that we will get 
“HH” once every four trials. It only means that if we run an infinite number of experiments, 
then 1/4 of the experiments will give us “HH.” 

The linkage between PMF and histogram can be quite practical. For example, while 
we do not know the true underlying distribution of the 26 letters of the English alphabet, we 
can collect a large number of words and plot the histogram. The example below illustrates 
how we can empirically define a random variable from the data. 


Example. There are 26 English letters, but the frequencies of the letters in writing are 
different. If we define a random variable X as a letter we randomly draw from an English 
text, we can think of X as an object with 26 different states. The mapping associated with the 
random variable is straightforward: X (“a”) = 1, X(“b”) = 2, etc. The probability of landing 
on a particular state approximately follows a histogram shown in Figure 3.7. The histogram 
provides meaningful values of the probabilities, e.g., px (1) = 0.0847, px (2) = 0.0149, etc. 
The true probability of the states may not be exactly these values. However, when we have 
enough samples, we generally expect the histogram to approach the theoretical PMF. The 
MATLAB and Python codes used to generate this histogram are shown below. 


% MATLAB code to generate the histogram 
load(‘ch3_data_English’) ; 
bar (£/100, ‘FaceColor’ ,[0.9,0.6,0.0]); 
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0.12 : 

0.1 J 
0.08 7 
0.06 : 
0.04 J 
0.02 


abcdefghijkimnopqrstuvwxyz 


Figure 3.7: The frequency of the 26 English letters. Data source: Wikipedia. 


xticklabels({‘a’,‘b’,‘c’,‘d’,‘e’,‘f’,‘g’,‘h’, ‘i’, ‘j’,‘k’,‘1’,... 
Sm? "nn? £0" 5 “pig G2 ss 824 bey est “we gtx? yes Zh) s 

xticks (1:26) ; 

yticks(0:0.02:0.2); 

axis([1 26 0 0.13]); 


We, 


# Python code generate the histogram 
import numpy as np 

import matplotlib.pyplot as plt 

f = np.loadtxt(‘./ch3_data_english.txt’) 


n = np.arange(26) 
plt.bar(n, £/100) 
ntag = [Aart “bs Se? 38d )5 er 5 Site Sen Oat eee mn es 2 


¢ ¢ 


m5 80°? 5° p? 9? s.r? 8? tft oe, 
plt.xticks(n, ntag) 


we, (x? Sy? 27] 


PMF = ideal histograms 


If a random variable is more or less a histogram, why is the PMF such an important concept? 
The answer to this question has two parts. The first part is that the histogram generated 
from a dataset is always an empirical histogram, so-called because the dataset comes from 
observation or experience rather than theory. Thus the histograms may vary slightly every 
time we collect a dataset. 

As we increase the number of data points in a dataset, the histogram will eventually 
converge to an ideal histogram, or a distribution. For example, counting the number of 
heads in 100 coin flips will fluctuate more in percentage terms than counting the heads in 10 
million coin flips. The latter will almost certainly have a histogram that is closer to a 50-50 
distribution. Therefore, the “histogram” generated by a random variable can be considered 
the ultimate histogram or the limiting histogram of the experiment. 

To help you visualize the difference between a PMF and a histogram, we show in 
Figure 3.8 an experiment in which a die is thrown N times. Assuming that the die is fair, 
the PMF is simply px(k) = 1/6 for k = 1,...,6, which is a uniform distribution across 
the 6 states. Now, we can throw the die many times. As N increases, we observe that the 
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N= 100 [ESSN = 1000 
0.2/ + ok | 
0.15 | \ 0.15} | 
0.1 + + oat | 
0.05 - 1 0.05} | 
0 0 
1 2 3 4 5 6 1 2 3 4 5 6 
(a) N = 100 (b) N = 1000 
HEN = 10000 EEN - 
0.2} + 9.2 oot | 
0.15 | } 0.15} | 
0.1} + oat | 
0.05 - 1 0.05} | 
0 0 
1 2 3 4 5 6 1 2 3 4 5 6 
(c) N = 10000 (d) PMF 


Figure 3.8: Histogram and PMF, when throwing a fair die N times. As N increases, the histograms are 
becoming more similar to the PMF. 


histogram becomes more like the PMF. You can imagine that when N goes to infinity, the 
histogram will eventually become the PMF. Therefore, when given a dataset, one way to 
think of it is to treat the data as random realizations drawn from a certain PMF. The more 
data points you have, the closer the histogram will become to the PMF. 

The MATLAB and Python codes used to generate Figure 3.8 are shown below. The 
two commands we use here are randi (in MATLAB), which generates random integer num- 
bers, and hist, which computes the heights and bin centers of a histogram. In Python, 
the corresponding commands are np.random.randint and plt.hist. Note that because of 
the different indexing schemes in MATLAB and Python, we offset the maximum index in 
np.random.randint to 7 instead of 6. Also, we shift the x-axes so that the bars are centered 
at the integers. 


% MATLAB code to generate the histogram 
x= [12345 6]; 
q = randi(6,100,1); 


figure; 

[num,val] = hist(q,x-0.5); 

bar (num/100, ‘FaceColor’,[0.8, 0.8,0.8]); 
axis([0 7 0 0.24]); 


# Python code generate the histogram 
import numpy as np 

import matplotlib.pyplot as plt 

q = np.random.randint (7 ,size=100) 
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plt.hist (qt+0.5,bins=6) 


This generative perspective is illustrated in Figure 3.9. We assume that the underlying 
latent random variable has some PMF that can be described by a few parameters, e.g., the 
mean and variance. Given the data points, if we can infer these parameters, we might retrieve 
the entire PMF (up to the uncertainty level intrinsic to the dataset). We refer to this inverse 
process as statistical inference. 


probability mass function 
generate data 


variable estimate model 


Probability 


Figure 3.9: When analyzing a dataset, one can treat the data points are samples drawn according to a 
latent random variable with certain a PMF. The dataset we observe is often finite, and so the histogram 
we obtain is empirical. A major task in data analysis is statistical inference, which tries to retrieve the 
model information from the available measurements. 


Returning to the question of why we need to understand the PMFs, the second part 
of the answer is the difference between synthesis and analysis. In synthesis, we start with 
a known random variable and generate samples according to the PMF underlying the ran- 
dom variable. For example, on a computer, we often start with a Gaussian random variable 
and generate random numbers according to the histogram specified by the Gaussian ran- 
dom variable. Synthesis is useful because we can predict what will happen. We can, for 
example, create millions of training samples to train a deep neural network. We can also 
evaluate algorithms used to estimate statistical quantities such as mean, variance, moments, 
etc., because the synthesis approach provides us with ground truth. In supervised learning 
scenarios, synthesis is vital to ensuring sufficient training data. 

The other direction of synthesis is analysis. The goal is to start with a dataset and 
deduce the statistical properties of the dataset. For example, suppose we want to know 
whether the underlying model is indeed a Gaussian model. If we know that it is a Gaussian 
(or if we choose to use a Gaussian), we want to know the parameters that define this 
Gaussian. The analysis direction addresses this model selection and parameter estimation 
problem. Moving forward, once we know the model and the parameters, we can make a 
prediction or do recovery, both of which are ubiquitous in machine learning. 

We summarize our discussions below, which is Key Concept 2 of this chapter. 


Key Concept 2: What are probability mass functions (PMFs)? 


PMEF's are the ideal histograms of random variables. 
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3.2.5 Estimating histograms from real data 


The following discussions about histogram estimation can be skipped if it is your first 


time reading the book. 


If you have a dataset, how would you plot the histogram? Certainly, if you have access 
to MATLAB or Python, you can call standard functions such as hist (in MATLAB) or 
np.histogram (in Python). However, when plotting a histogram, you need to specify the 
number of bins (or equivalently the width of bins). If you use larger bins, then you will have 
fewer bins with many elements in each bin. Conversely, if the bin width is too small, you 
may not have enough samples to fill the histogram. Figure 3.10 illustrates two histograms 
in which the bins are respectively too large and too small. 


1000 ; 50 : : 
MK =5 MK = 200 
800 | ] 40 7 
600 - 30 
400 - 20 
200; 10 
0 || | 1 0 Jiu ut 
4 6 8 10 6 8 10 
(a) 5 bins (b) 200 bins 


Figure 3.10: The width of the histogram has substantial influence on the information that can be 
extracted from the histogram. 


The MATLAB and Python codes used to generate Figure 3.10 are shown below. Note 
that here we are using an exponential random variable (to be discussed in Chapter 4). In 
MATLAB, calling an exponential random variable is done using exprnd, whereas in Python 
the command is np.random.exponential. For this experiment, we can specify the number 
of bins k, which can be set to k = 200 or k = 5. To suppress the Python output of the array, 
we can add a semicolon ;. A final note is that lambda is a reserved variable in Python. Use 
something else. 


% MATLAB code used to generate the plots 
lambda = 1; 
k 1000; 


x exprnd(1/lambda, [k,1]); 
[num,val] = hist(X,200); 
bar(val,num,‘FaceColor’,[1, 0.5,0.5]); 


# Python code used to generate the plots 
import numpy as np 

import matplotlib.pyplot as plt 

lambd = 1 
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k = 1000 
X = np.random.exponential(1/lambd, size=k) 
plt.hist (X, bins=200) ; 


In statistics, there are various rules to determine the bin width of a histogram. We 
mention a few of them here. Let K be the number of bins and N the number of samples. 


e Square-root: K = /N 
e Sturges’ formula: K = log, N +1. 
e Rice Rule: K = 2\/N 


e Scott’s normal reference rule: K = max X—min x where h = pa a is the bin 
width. 


For the example data shown in Figure 3.10, the histograms obtained using the above rules 
are given in Figure 3.11. As you can see, different rules have different suggested bin widths. 
Some are more conservative, e.g., using fewer bins, whereas some are less conservative. In 
any case, the suggested bin widths do seem to provide better histograms than the original 
ones in Figure 3.10. However, no bin width is the best for all purposes. 


500 1 1 500 1 : 
HE Square-root, K = 32 Hl Sturges Rule, K = 11 
400 + + 400 }| 
300 - 4+ 3004 
200 + 200} 
100 100 5 
; ; om 
0 1 2 3 4 5 1 2 3 4 5 
500 ; ; 500 ; : ; : 
HlllRice Rule, K = 20 HE Scott Rule, K = 22 
400 400 
7 300) 
200 
100 
0 
0 1 2 3 4 5 0 1 2 3 4 5 


Figure 3.11: Histograms of a dataset using different bin width rules. 


Beyond these predefined rules, there are also algorithmic tools to determine the bin 
width. One such tool is known as cross-validation. Cross-validation means defining some 
kind of cross-validation score that measures the statistical risk associated with the his- 
togram. A histogram having a lower score has a lower risk, and thus it is a better histogram. 
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Note that the word “better” is relative to the optimality criteria associated with the cross- 
validation score. If you do not agree with our cross-validation score, our optimal bin width is 
not necessarily the one you want. In this case, you need to specify your optimality criteria. 

Theoretically, deriving a meaningful cross-validation score is beyond the scope of this 
book. However, it is still possible to understand the principle. Let h be the bin width of the 
histogram, K the number of bins, and N the number of samples. Given a dataset, we follow 
this procedure: 


e Step 1: Choose a bin width h. 


e Step 2: Construct a histogram from the data, using the bin width h. The histogram will 
have the empirical PMF values pj, p2,...,PK, which are the heights of the histograms 
normalized so that the sum is 1. 

e Step 3: Compute the cross-validation score (see Wasserman, All of Statistics, Section 


20.2): 
2 N+1 


(N—-Dh (N—Dh 


J(h) = (pi +3 +++: +d) (3.4) 


e Repeat Steps 1, 2, 3, until we find an h that minimizes J(h). 


Note that when we use a different h, the PMF values pi, p2,...,p« will change, and the 
number of bins K will also change. Therefore, when changing h, we are changing not only 
the terms in J(h) that explicitly contain h but also terms that are implicitly influenced. 
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Figure 3.12: Cross-validation score for the histogram. (a) The score of one particular dataset. (b) The 
scores for many different datasets generated by the same model. 


For the dataset we showed in Figure 3.10, the cross-validation score J(h) is shown in 
Figure 3.12. We can see that although the curve is noisy, there is indeed a reasonably clear 
minimum happening around 20 < K < 30, which is consistent with some of the rules. 

The MATLAB and Python codes we used to generate Figure 3.12 are shown below. 
The key step is to implement Equation (3.4) inside a for-loop, where the loop goes through 
the range of bins we are interested in. To obtain the PMF values pi,...,)K, we call hist 
in MATLAB and np.histogram in Python. The bin width h is the number of samples n 
divided by the number of bins m. 
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% MATLAB code to perform the cross validation 
lambda 
1000; 
= exprnd(1/lambda, [n,1]); 
= 6:200; 
= zeros(1,195); 
i=1:195 
{num,binc] = hist(X,m(i)); 
h = n/m(i); 
JGi) = 2/((n-1) *h)-( (+1) /((n-1) *h)) *sum( (num/n) .*2 ); 


plot(m,J, ‘LineWidth’ ,4, ‘Color’, [0.9,0.2,0.0]); 


# Python code to perform the cross validation 
import numpy as np 
import matplotlib.pyplot as plt 
lambd = 1 
1000 
np.random.exponential(1/lambd, size=n) 
np.arange(5, 200) 
np.zeros((195)) 
for i in range(0,195): 
hist,bins = np.histogram(X, bins=m[i] ) 
h = n/n[i] 
Jli] = 2/((n-1)*h)-( (+1) /((n-1) *h) ) *np.sum( (hist/n) **2) 
plt.plot(m,J); 


In Figure 3.12(b), we show another set of curves from the same experiment. The 
difference here is that we assume access to the true generative model so that we can generate 
the many datasets of the same distribution. In this experiment we generated T’ = 1000 
datasets. We compute the cross-validation score J(h) for each of the datasets, yielding T 
score functions J‘ (h),..., J (h). We subtract the minimum because different realizations 
have different offsets. Then we compute the average: 


r 


J(h) = = S- {3 (0) — min {J (h)} \ (3.5) 


t=1 


This gives us a smooth red curve as shown in Figure 3.12(b). The minimum appears to be 
at N = 25. This is the optimal N, concerning the cross-validation score, on the average of 
all datasets. 

All rules, including cross-validation, are based on optimizing for a certain objective. 
Your objective could be different from our objective, and so our optimum is not necessarily 
your optimum. Therefore, cross-validation may not be the best. It depends on your problem. 


End of the discussion. 
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3.3. Cumulative Distribution Functions (Discrete) 


While the probability mass function (PMF) provides a complete characterization of a dis- 
crete random variable, the PMFs themselves are technically not “functions” because the 
impulses in the histogram are essentially delta functions. More formally, a PMF p x(k) 
should actually be written as 


px(t)= S>— px(k) + d(a—k) 
—SEaSPS (‘~*~ SS 


ke X(Q) PMF values delta function 


This is a train of delta functions, where the height is specified by the probability mass px (k). 


For example, a random variable with PMF values 


px(0) =. px(l)= 5; Px)= 3 


will be expressed as 


px) = 58(a) + 56(2 eeice 


4 


Since delta functions need to be integrated to generate values, the typical things we want to 
do, e.g., integration and differentiation, are not as straightforward in the sense of Riemann- 
Stieltjes. 

The way to handle the unfriendliness of the delta functions is to consider mild modi- 
fications of the PMF. This notation of “cumulative” distribution functions will allow us to 
resolve the delta function problems. We will defer the technical details to the next chap- 
ter. For the time being, we will briefly introduce the idea to prepare you for the technical 
discussion later. 


3.3.1 Definition of the cumulative distribution function 


Definition 3.3. Let X be a discrete random variable with Q = {a ,%2,...}. The 
cumulative distribution function (CDF) of X is 


(3.6) 


IfQ={...,-1,0,1,2,...}, then the CDF of X is 


Ik 
Fx(k) SPX < k= S- px(0. 


£=—00 


A CDF is essentially the cumulative sum of a PMF from —oo to z, where the variable 2’ in 
the sum is a dummy variable. 
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Example 3.6. Consider a random variable X with PMF px (0) = 4, px(1) = } and 
px(4) = 4. The CDF of X can be computed as 


2 


0 1 4 0 1 4 
(a) PMF px(k) (b) CDF Fx (k) 


Figure 3.13: Illustration of a PMF and a CDF. 


The MATLAB code and the Python code used to generate Figure 3.13 are shown 
below. The CDF is computed using the command cumsum in MATLAB and np.cumsum in 
Python. 


MATLAB code to generate a PMF and a CDF 
= [0.25 0.5 0.25]; 

[0 1 4]; 

cumsum (p) ; 


figure(1); 

stem(x,p,‘.’, ‘LineWidth’ ,4, ‘MarkerSize’ ,50); 

figure(2); 

stairs([-4 x 10],[0 F 1],‘.-’,‘LineWidth’ ,4, ‘MarkerSize’ ,50); 


% Python code to generate a PMF and a CDF 
import numpy as np 
import matplotlib.pyplot as plt 
np.array([0.25, 0.5, 0.25]) 
= np.array([0, 1, 4]) 
np.cumsum(p) 


.stem(x,p,use_line_collection=True); plt.show() 
.step(x,F); plt.show() 


3.3. CUMULATIVE DISTRIBUTION FUNCTIONS (DISCRETE) 


Why is CDF a better-defined function than PMF? There are technical reasons associ- 
ated with whether a function is integrable. Without going into the details of these discus- 
sions, a short answer is that delta functions are defined through integrations; they are not 
functions. A delta function is defined as a function such that 6(a) = 0 everywhere except at 
a =0, and f,, 6(x) dx = 1. On the other hand, a staircase function is always well-defined. 
The discontinuous points of a staircase can be well defined if we specify the gap between 
two consecutive steps. For example, in Figure 3.13, as soon as we specify the gap 1/4, 1/2, 
and 1/4, the staircase function is completely defined. 


Example. Figure 3.14 shows the empirical histogram of the English letters and the corre- 
sponding empirical CDF. We want to differentiate PMF versus histogram and CDF versus 
empirical CDF. The empirical CDF is the CDF computed from a finite dataset. 


0.04 
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Figure 3.14: PMF and a CDF of the frequency of English letters. 


3.3.2 Properties of the CDF 


We observe from the example in Figure 3.13 that a CDF has several properties. First, being 
a Staircase function, the CDF is non-decreasing. It can stay constant for a while, but it never 
drops. Second, the minimum value of a CDF is 0, whereas the maximum value is 1. It is 0 
for any value that is smaller than the first state; it is 1 for any value that is larger than the 
last state. Third, the gap at each jump is exactly the probability mass at that state. Let us 
summarize these observations in the following theorem. 


Theorem 3.2. If X is a discrete random variable, then the CDF of X has the following 
properties: 
(i) The CDF is a sequence of increasing unit steps. 


(ti) The maximum of the CDF is when x = 00: Fx(+o00) = 1. 


(iit) The minimum of the CDF is when x = —oo: Fx(—oo) = 0. 


(iv) The unit steps have jumps at positions where px (x) > 0. 


Proof. Statement (i) can be seen from the summation 


Fx (2) = S- px (a). 


a! <ax 
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Since the probability mass function is non-negative, the value of F'x is larger when the value 
of the argument is larger. That is, x < y implies Fx (x) < Fx (y). The second statement (ii) 
is true because the summation includes all possible states. So we have 


Co 


Fx (+00) = yy px(a’) = 1. 


x! =—0o 


Similarly, for the third statement (iii), 


Fx(-c0)= S° px(2’). 


z!<—oo 


The summation is taken over an empty set, and so F'y(—oo) = 0. Statement (iv) is true 
because the cumulative sum changes only when there is a non-zero mass in the PMF. 


As we can see in the proof, the basic argument of the CDF is the cumulative sum of 
the PMF. By definition, a cumulative sum always adds mass. This is why the CDF is always 
increasing, has 0 at —oo, and has 1 at +oo. This last statement deserves more attention. It 
implies that the unit step always has a solid dot on the left-hand side and an empty dot 
on the right-hand side, because when the CDF jumps, the final value is specified by the 
“<” sion in Equation (3.6). The technical term for this property is right continuous. 


3.3.3 Converting between PMF and CDF 


Theorem 3.3. If X is a discrete random variable, then the PMF of X can be obtained 
from the CDF by 

px (tp) = Fx (rn) — Fx(xx-1), (3.8) 
where we assumed that X has a countable set of states {a1, x2,...}. If the sample space 


of the random variable X contains integers from —oo to +co, then the PMF can be 
defined as 


px(k) = x(k) — Fx (k— 1). (3.9) 


Example 3.7. Continuing with the example in Figure 3.13, if we are given the CDF 


how do we find the PMF? We know that the PMF will have non-negative values only 
at « = 0,1,4. For each of these 2, we can show that 
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3.4 Expectation 


When analyzing data, it is often useful to extract certain key parameters such as the mean 
and the standard deviation. The mean and the standard deviation can be seen from the lens 
of random variables. In this section, we will formalize the idea using expectation. 


3.4.1 Definition of expectation 


Definition 3.4. The expectation of a random variable X is 


» x px (a). 


EX (Q) 


Expectation is the mean of the random variable X. Intuitively, we can think of px (a) as the 
percentage of times that the random variable X attains the value x. When this percentage 
is multiplied by z, we obtain the contribution of each x. Summing over all possible values 
of x then yields the mean. To see this more clearly, we can write the definition as 


E[X] = S- x px (a) 


xe X (Q) a state X takes 
a the percentage 


sum over all states 


Figure 3.15 illustrates a PMF that contains five states 71,...,75. Corresponding to each 
state are px(#1),...,px (a5). For this PMF to make sense, we must assume that px (#1) + 


+++ + px(a5) = 1. To simplify notation, let us define p; = px(a;). Then the expectation 
of X is just the sum of the products: value (#;) times height (p;). This gives E[X] = 


da Tap x (ai). 


E[X] =pi%,+...+ psrs 


U1 %Q 3 La WU 


Figure 3.15: The expectation of a random variable is the sum of 2; p;. 
We emphasize that the definition of the expectation is exactly the same as the usual 
way we calculate the average of a dataset. When we calculate the average of a dataset 


D = {2 2), ...,2()}, we sum up these N samples and divide by the number of samples. 
This is what we called the empirical average or the sample average: 


N 
1 
= (n) 
average = 5; 2d ie (3.11) 


125 


CHAPTER 3. DISCRETE RANDOM VARIABLES 


Of course, in a typical dataset, these N samples often take distinct values. But suppose 
that among these N samples there are only K different values. For example, if we throw a 
die a million times, every sample we record will be one of the six numbers. This situation 
is illustrated in Figure 3.16, where we put the samples into the correct bin storing these 
values. In this case, to calculate the average we are effectively doing a binning: 


K 
average = = S- value x, x number of samples with value xx. (3.12) 
k=1 


Equation (3.12) is exactly the same as Equation (3.11), as long as the samples can be grouped 
into K different values. With a little calculation, we can rewrite Equation (3.12) as 


K . 
number of samples with value x; 
average = y value rp, ; 
— SS N 
k=1 a state X takes ¥ 


the percentage 
sum of all states 


which is the same as the definition of expectation. 
D= fa) ¢@ 2 24), as ,oN)} 


8 

 /% 
% \ | / 9 
% BW\S % 
SMR HM 
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Figure 3.16: If we have a dataset D containing N samples, and if there are only K distinct values, we 
can effectively put these NV samples into K bins. Thus, the “average” (which is the sum divided by the 
number JV) is exactly the same as our definition of expectation. 


The difference between E[X] and the average is that E[X] is computed from the ideal 
histogram, whereas average is computed from the empirical histogram. When the number of 
samples N approaches infinity, we expect the average to approximate E[X]. However, when 
N is small, the empirical average will have random fluctuations around ELX]. Every time 
we experiment, the empirical average may be slightly different. Therefore, we can regard 
i[X] as the true average of a certain random variable, and the empirical average as a finite- 
sample average based on the particular experiment we are working with. This summarizes 
Key Concept 3 of this chapter. 


Key Concept 3: What is expectation? 


Expectation = Mean = Average computed from a PMF. 


If we are given a dataset on a computer, computing the mean can be done by calling 
the command mean in MATLAB and np.mean in Python. The example below shows the 
case of finding the mean of 10000 uniformly distributed random numbers. 
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% MATLAB code to compute the mean of a dataset 
rand(10000, 1); 
mean(X) ; 


# Python code to compute the mean of a dataset 
import numpy as np 

X = np.random.rand(10000) 

mX = np.mean(X) 


Example 3.8. Let X be a random variable with PMF px (0) = 1/4, px (1) = 1/2 and 
px (2) = 1/4. We can show that the expectation is 


(3) m(3) eae 


px (0) px (1) px (2) 


On MATLAB and Python, if we know the PMF then computing the expectation is 
straight-forward. Here is the code to compute the above example. 


% MATLAB code to compute the expectation 
p = [0.25 0.5 0.25]; 
x = [0 1 2]; 

= sum(p.*x); 


# Python code to compute the expectation 
import numpy as np 
p = np.array([0.25, 0.5, 0.25]) 
x = np.array([0, 1, 2]) 
np.sum(p*x) 


Example 3.9. Flip an unfair coin, where the probability of getting a head is 3, Let 
X be a random variable such that X = 1 means getting a head. Then we can show 


that px(1) = 3 and px(0) = ;. The expectation of X is therefore 


3 


[X] = (1)px(1) + (O)px(0) = (1) ( 


Center of mass. How would you interpret the result of this example? Does it mean 
that, on average, we will get 3/4 heads (but there is not anything called 3/4 heads!). Recall 
the definition of a random variable: it is a translator that translates a descriptive state 
to a number on the real line. Thus the expectation, which is an operation defined on the 
real line, can only tell us what is happening on the real line, not in the original sample 
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Ty 


Center of mass EX | 


Figure 3.17: Center of mass. If a state x2 is more influential than another state x1, the center of mass 
E[X] will lean towards x2. 


space. On the real line, the expectation can be regarded as the center of mass, which is the 
point where the “forces” between the two states are “balanced”. In Figure 3.17 we depict a 
random variable with two states 7, and x2. The state x, has less influence (because px (21) 
is smaller) than x2. Therefore the center of mass is shifted towards x2. This result shows us 
that the value E[X] is not necessarily in the sample space. E[X] is a deterministic number 
with nothing to do with the sample space. 


Example 3.10. Let X be a random variable with PMF px(k) = =, for k = 1,2,3,.... 
The expectation is 


Qk» 


LX] =) Apx(k) = Sok 5 
k= (Ay 


le 1 1 
eh ree 
k= 


On MATLAB and Python, if you want to verify this answer you can use the following 
code. Here, we approximate the infinite sum by a finite sum of k = 1,...,100. 


% MATLAB code to compute the expectation 
1:100; 

p = 0.5.7k; 
sum(p.*k) ; 


# Python code to compute the expectation 
import numpy as np 
k = np.arange(100) 
p = np.power(0.5,k) 


EX = np.sum(p*k) 


Example 3.11. Roll a die twice. Let X be the first roll and Y be the second roll. 
Let Z = max(X,Y). To compute the expectation E[Z], we first construct the sample 
space. Since there are two rolls, we can construct a table listing all possible pairs of 
outcomes. This will give us {(1,1), (1,2),...,(6,6)}. Now, we calculate Z, which is the 
max of the two rolls. So if we have (1,3), then the max will be 3, whereas if we have 
(5,2), then the max will be 5. We can complete a table as shown below. 
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6 6 6 6 


This table tell us that Z has 6 states. The PMF of Z can be determined by 
counting the number of times a state shows up in the table. Thus, we can show that 


1 3 5 
1)= — 2 = — SS 
pz( ) 3g” Pal ) 3g7 P2(3) 36’ 
7 9 11 
pz(4) = 36° pz(5) = 36 pz(6) = 36" 


The expectation of Z is therefore 


Example 3.12. Consider a game in which we flip a coin 3 times. The reward of the 
game is 


$1 if there are 2 heads 
$8 if there are 3 heads 
$0 if there are 0 or 1 head 


There is a cost associated with the game. To enter the game, the player has to pay 
$1.50. We want to compute the net gain, on average. 


To answer this question, we first note that the sample space contains 8 elements: 
HHH, HHT, HTH, THH, THT, TTH, HTT, TTT. Let X be the number of heads. 
Then the PMF of X is 

1 3 3 1 
0o)== l= 2) == 3) ==. 
px) =) Px) = 5; Px2@)——. PxG) =. 
We then let Y be the reward. The PMF of Y can be found by “adding” the probabilities 
of X. This yields 


py (0) = px(0) + px(1) = = py (1) = px(2) = = py (8) = px(3) = = 
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The expectation of Y is 


‘[X] = (0) (=) +(1) (3) + (8) (=) =>. 


a2, 
8 


, the net gain (on average) is —}. 


Since the cost of the game is 3 


3.4.2 Existence of expectation 


Does every PMF have an expectation? No, because we can construct a PMF such that the 
expectation is undefined. 


Example 3.13. Consider a random variable X with the following PMF: 


6 
px(k) = 72RD? 


Using a result from algebra, one can show that >>? , z = = Therefore, px(k) is a 
legitimate PMF because 3°, px (k) = 1. However, the expectation diverges, because 


|X] 


where the limit is due to the harmonic series®: 1 + 4 + $ +-+---=0©o. 


“https: //en.wikipedia.org/wiki/Harmonic_series_(mathematics) 


A PMF has an expectation when it is absolutely summable. 


Definition 3.5. A discrete random variable X is absolutely summable if 


5 def 
IX = So [zl px(2) < oo. 
we X (Q) 


This definition tells us that not all random variables have a finite expectation. This 
is a very important mathematical result, but its practical implication is arguably limited. 
Most of the random variables we use in practice are absolutely summable. Also, note that 
the property of absolute summability applies to discrete random variables. For continuous 
random variables, we have a parallel concept called absolute integrability, which will be 
discussed in the next chapter. 


3.4.3. Properties of expectation 


The expectation of a random variable has several useful properties. We list them below. 
Note that these properties apply to both discrete and continuous random variables. 
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Theorem 3.4. The expectation of a random variable X has the following properties: 


(i) Function. For any function g, 


ve X (2) 


(it) Linearity. For any function g and h, 


E[g(X) + h(X)| =E 


(itt) Scale. For any constant c, 


(iv) DC Shift. For any constant c, 


E[X +c] =E 


Proof of (i): A pictorial proof of (i) is shown in Figure 3.18. The key idea is a change of 
variable. 


py (y) 


x 
Figure 3.18: By letting g(X) = Y, the PMFs are not changed. What changes are the states. 


When we have a function Y = g(X), the PMF of Y will have impulses moved from x 
(the horizontal axis) to g(x) (the vertical axis). The PMF values (i.e., the probabilities or 
the height of the stems), however, are not changed. If the mapping g(X) is many-to-one, 
multiple PMF values will add to the same position. Therefore, when we compute E[g(X)], 
we compute the expectation along the vertical axis. 


Practice Exercise 3.3. Prove statement (iii): For any constant c, E 


Solution. Recall the definition of expectation: 


eX] = > (ca)px(z)=¢ > apx(x) = IX], 


LEX (Q) 


Statement (iii) is illustrated in Figure 3.19. Here, we assume that the original PMF has 3 
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states X = 0,1,2. We multiply X by a constant c = 3. This changes X to cX = 0,3,6. 
However, since the probabilities are not changed, the height of the PMF values remains. 
Therefore, when computing the expectation, we just multiply E[X] by c to get cE[X]. 


px (x) px (x) 
E[X] 


Figure 3.19: Pictorial representation of E[cX] = cE[X]. When we multiply X by c, we fix the probabil- 
ities but make the spacing between states wider/narrower. 


Practice Exercise 3.4. Prove statement (ii): For any function g and h, 
[g(X)] + E[A(X)]. 


Solution. Recall the definition of expectation: 
[g(X) + A(X) = SO [o(e) + h@)[px(2) 
EX (Q) 


d> 9(z)px(z)+ D2 h@)px(z) 


2€X(Q) re X(Q) 
———————— 


=El9(X)] =E[h(X)] 
[g(X)] + E[h(X)]. 


Practice Exercise 3.5. Prove statement (iv): For any constant c, 


Solution. Recall the definition of expectation: 
IX+d= > (w@+e)px(a) 
LEX (Q) 


SE upx(x)+c: 


LEX (Q) 


=E[X] 
a|X] +c. 


This result is illustrated in Figure 3.20. As we add a constant to the random variable, 
its PMF values remain the same but their positions are shifted. Therefore, when computing 
the mean, the mean will be shifted accordingly. 
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Figure 3.20: Pictorial representation of EX +c] = ELX]+c. When we add c to X, we fix the probabilities 
and shift the entire PMF to the left or to the right. 


Example 3.14. Let X be a random variable with four equally probable states 0, 1, 2, 3. 
We want to compute the expectation E[cos(7X/2)]. To do so, we note that 


[cos(nX/2)])= S- cos (=) px (2) 
) 


rEx(Q 


(cos 0) () + ( ue c = (<) + (cos 


es) 


Example 3.15. Let X be a random variable with E[X] = 1 and E[X?] = 3. We want 
to find the expectation E[(aX + b)?]. To do so, we realize that 


(aX + b)?] © Bla? x? + 2abxX +0?) © a2E[X?] + 2abE LX] + b? = 3a? + 2ab + 82, 


where (a) is due to expansion of the square, and (6) holds in two steps. The first step 
is to apply statement (ii) for individual functions of expectations, and the second step 
is to apply statement (iii) for scalar multiple of the expectations. 


3.4.4 Moments and variance 


Based on the concept of expectation, we can define a moment: 


Definition 3.6. The kth moment of a random variable X is 


oe | = S- t* px (x). 


Essentially, the kth moment is the expectation applied to X”. The definition follows from 


statement (i) of the expectation’s properties. Using this definition, we note that E[X] is the 
first moment and E[X?] is the second moment. Higher-order moments can be defined, but 
in practice they are less commonly used. 
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Example 3.16. Flip a coin 3 times. Let X be the number of heads. Then 


1 3 3 1 
DX = 3 px(1)= 3, px(2) = 3, px(3) = 5. 


The second moment E 


The second moment E[X7] is 


= ye ai 


Using the second moment, we can define the variance of a random variable. 


Definition 3.7. The variance of a random variable X is 


Var[X] = E[(X — »)?), 


where tp = E|X] is the expectation of X. 


We denote a? by Var[X]. The square root of the variance, ¢, is called the standard deviation 
of X. Like the expectation E[X], the variance Var[X] is computed using the ideal histogram 
PMF. It is the limiting object of the usual standard deviation we calculate from a dataset. 

On a computer, computing the variance of a dataset is done by calling built-in com- 
mands such as var in MATLAB and np. var in Python. The standard deviation is computed 
using std and np.std, respectively. 


% MATLAB code to compute the variance 
X = rand(10000,1); 


vX = var(X); 
sX = std(X); 


% Python code to compute the variance 
import numpy as np 
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X = np.random.rand(10000) 
vX = np.var(X) 


sX = np.std(X) 


What does the variance mean? It is a measure of the deviation of the random variable 
X relative to its mean. This deviation is quantified by the squared difference (X — 1)”. The 
expectation operator takes the average of the deviation, giving us a deterministic number 
a(X — p)?]. 


Theorem 3.5. The variance of a random variable X has the following properties: 


(i) Moment. 


Var[X] = E[X?] — E[X/?. 


(ii) Scale. For any constant c, 
Var[eX] = c?Var[X]. 
(iti) DC Shift. For any constant c, 


Var[X + c] = Var[X]. 


tht 


—_—_—_>> 
Var[X] Var[X + c] 


Figure 3.21: Pictorial representations of Var[¢X] = c?Var[X] and Var[X +c] = Var[X]. 


Practice Exercise 3.6. Prove Theorem 3.5 above. 


Solution. For statement (i), we show that 


Var[X] = E|(X — p)*) = E[X? — 2X p+ p?] =E 
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Statement (ii) holds because E[cX] = cu and 


Var[eX] = E[(cX — E[cX])?] 
= El(eX — en)*| =e E[(X — p)?| >c? Var X]. 


ie} 


Statement (iii) holds because 


Var[X + c] = E[((X +c) —E[X +c])*] =E 


The properties above are useful in various ways. The first statement provides a link connect- 
ing variance and the second moment. Statement (ii) implies that when X is scaled by c, the 
variance should be scaled by c? because of the square in the second moment. Statement (iii) 
says that when X is shifted by a scalar c, the variance is unchanged. This is true because 
no matter how we shift the mean, the fluctuation of the random variable remains the same. 


Practice Exercise 3.7. Flip a coin with probability p to get a head. Let X be a 
random variable denoting the outcome. The PMF of X is 


pxO)=1—p, ~ pxilj=p 


Find E[X], E[X?] and Var[X]. 


Solution. The expectation of X is 


E[X] = (0)px (0) + (1)px (1) = (0)(1 — p) + (1) (®) = p. 


The second moment is 


{[X?] = (0)?px (0) + (1)?px(1) = p. 


The variance is 


Var[X] = E[X?] — E[X]? = p — p? = p(1—p). 


3.5 Common Discrete Random Variables 


In the previous sections, we have conveyed three key concepts: one about the random vari- 
able, one about the PMF, and one about the mean. The next step is to introduce a few 
commonly used discrete random variables so that you have something concrete in your “tool- 
box.” As we have mentioned before, these predefined random variables should be studied 
from a synthesis perspective (sometimes called generative). The plan for this section is to 
introduce several models, derive their theoretical properties, and discuss examples. 

Note that some extra effort will be required to understand the origins of the random 
variables. The origins of random variables are usually overlooked, but they are more impor- 
tant than the equations. For example, we will shortly discuss the Poisson random variable 
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Pp 


0 1 
Figure 3.22: A Bernoulli random variable has two states with probability p and 1 — p. 


and its PMF px(k) = a a Why is the Poisson random variable defined in this way? If 
you know how the Poisson PMF was originally derived, you will understand the assumptions 
made during the derivation. Consequently, you will know why Poisson is a good model for 
internet traffic, recommendation scores, and image sensors for computer vision applications. 
You will also know under what situation the Poisson model will fail. Understanding the 
physics behind the probability models is the focus of this section. 


3.5.1 Bernoulli random variable 


We start discussing the simplest random variable, namely the Bernoulli random variable. 
A Bernoulli random variable is a coin-flip random variable. The random variable has two 
states: either 1 or 0. The probability of getting 1 is p, and the probability of getting 0 is 
1—p. See Figure 3.22 for an illustration. Bernoulli random variables are useful for all kinds 
of binary state events: coin flip (H or T), binary bit (1 or 0), true or false, yes or no, present 
or absent, Democrat or Republican, etc. 

To make these notions more precise, we define a Bernoulli random variable as follows. 


Definition 3.8. Let X be a Bernoulli random variable. Then, the PMF of X is 
px(0)=1—p, — px(1) =p, 


where 0 < p< 1 1s called the Bernoulli parameter. We write 


X ~ Bernoulli(p) 


to say that X is drawn from a Bernoulli distribution with a parameter p. 


In this definition, the parameter p controls the probability of obtaining 1. In a coin-flip event, 
p is usually 5, meaning that the coin is fair. However, for biased coins p is not necessarily 4. 
For other situations such as binary bits (0 or 1), the probability of obtaining 1 could be very 
different from the probability of obtaining 0. 

In MATLAB and Python, generating Bernoulli random variables can be done by call- 
ing the binomial random number generator np.random.binomial (Python) and binornd 
(MATLAB). When the parameter n is equal to 1, the binomial random variable is equiv- 
alent to a Bernoulli random variable. The MATLAB and Python codes to synthesize a 
Bernoulli random variable are shown below. 
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MATLAB code to generate 1000 Bernoulli random variables 
0.5; 
= 1; 
binornd(n,p, [1000,1]); 
[num, ~] = hist(X, 10); 
bar (linspace(0,1,10), num, ‘FaceColor’,[0.4, 0.4, 0.8]); 


# Python code to generate 1000 Bernoulli random variables 
import numpy as np 
import matplotlib.pyplot as plt 

0.5 


np.random. binomial (n,p,size=1000) 
hist (X,bins=‘ auto’) 


An alternative method in Python is to call stats.bernoulli.rvs to generate random 
Bernoulli numbers. 


# Python code to call scipy.stats library 
import numpy as np 

import matplotlib.pyplot as plt 

import scipy.stats as stats 

p= 0.5 

X = stats.bernoulli.rvs(p,size=1000) 
plt.hist(X,bins=‘auto’); 


Properties of Bernoulli random variables 


Let us now derive a few key statistical properties of a Bernoulli random variable. 


Theorem 3.6. If X ~ Bernoulli(p), then 


E[X]=p, E[X*]=p, _—- Var[X] = p(1 —p). 


Proof. The expectation can be computed as 
S[X] = (1)px (1) + (0)px (0) = (1)(p) + (0). — p) = p. 


The second moment is 


p|X?] = (1?)(p) + (0°)(1— p) =p. 


Therefore, the variance is 


Var[X] = E[X?] — ? = p— p? = p(1—p). 


A useful property of the Python code is that we can construct an object rv. Then we 
can call rv’s attributes to determine its mean, variance, etc. 
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# Python code to generate a Bernoulli rv object 
import numpy as np 

import matplotlib.pyplot as plt 

import scipy.stats as stats 


p= 0.5 

= stats.bernoulli(p) 
mean, var = rv.stats(moments=‘mv’ ) 
print(mean, var) 


In both MATLAB and Python, we can plot the PMF of a Bernoulli random variable, 
such as the one shown in Figure 3.23. To do this in MATLAB, we call the function binopdf, 
with the evaluation points specified by x. 


1 


0.8 - J 


0.6 - J 


0.4 4 


0.2; J 


0 1 1 1 1 
-0.2 0 0.2 0.4 0.6 0.8 1 1.2 


MATLAB code to plot the PMF of a Bernoulli 


= 0.3; 
= [0,1]; 
binopdf(x,1,p); 
stem(x, f, ‘bo’, ‘LineWidth’, 8); 


In Python, we construct a random variable rv. With rv, we can call its PMF rv.pmf: 


# Python code to plot the PMF of a Bernoulli 
import numpy as np 

import matplotlib.pyplot as plt 

import scipy.stats as stats 

p = 0.3 


stats .bernoulli(p) 
= np.linspace(0, 1, 2) 
rv.pmf (x) 
plt.plot(x, f, ‘bo’, ms=10); 
plt.vlines(x, 0, f, colors=‘b’, lw=5, alpha=0.5); 
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When will a Bernoulli random variable have the maximum variance? 


Let us take a look at the variance of the Bernoulli random variable. For any given p, the 
variance is p(1—p). This is a quadratic equation. If we let V(p) = p(1—p), we can show that 
the maximum is attained at p = 1/2. To see this, take the derivative of V(p) with respect 
to p. This will give us HV (vp) = 1 — 2p. Equating to zero yields 1 — 2p = 0, so p = 1/2. 
We know that p = 1/2 is a maximum and not a minimum point because the second order 
derivative V" (p) = —2, which is negative. Therefore V(p) is maximized at p = 1/2. Now, 
since 0 < p < 1, we also know that V(0) = 0 and V(1) = 0. Therefore, the variance is 
minimized at p = 0 and p = 1. Figure 3.24 shows a graph of the variance. 


V(p) = p(1 — p) 


0 1/2 1 P 


Figure 3.24: The variance of a Bernoulli reaches maximum at p = 1/2. 


Does this result make sense? Why is the variance maximized at p = 1/2? If we think 
about this problem more carefully, we realize that a Bernoulli random variable represents a 
coin-flip experiment. If the coin is biased such that it always gives heads, on the one hand, 
it is certainly a bad coin. However, on the other hand, the variance is zero because there 
is nothing to vary; you will certainly get heads. The same situation happens if the coin is 
biased towards tails. However, if the coin is fair, i.e., p = 1/2, then the variance is large 
because we only have a 50% chance of getting a head or a tail whenever we flip a coin. 
Nothing is certain in this case. Therefore, the maximum variance happening at p = 1/2 
matches our intuition. 


Rademacher random variable 


A slight variation of the Bernoulli random variable is the Rademacher random variable, 
which has two states: +1 and —1. The probability getting +1 and —1 is 1/2. Therefore, the 
PMF of a Rademacher random variable is 


1 1 
px(-l)= rt and px(+1)= 2 


Practice Exercise 3.8. Show that if X is a Rademacher random variable then 
(X + 1)/2 ~ Bernoulli(1/2). Also show the converse: If Y ~ Bernoulli(1/2) then 2Y —1 
is a Rademacher random variable. 


Solution. Since X can either be +1 or —1, we show that if X = +1 then (X+1)/2=1 
and if X = —1 then (X + 1)/2 = 0. The probabilities of getting +1 and —1 are equal. 
Thus, the probabilities of getting (X + 1)/2 = 1 and 0 are also equal. So the resulting 
random variable is Bernoulli(1/2). The other direction can be proved similarly. 


140 


3.5. COMMON DISCRETE RANDOM VARIABLES 


Bernoulli in social networks: the Erd6s-Rényi graph 


The study of networks is a big branch of modern data science. It includes social networks, 
computer networks, traffic networks, etc. The history of network science is very long, but 
one of the most basic models of a network is the Erdés-Rényi graph, named after Paul 
Erdés and Alfréd Rényi. The underlying probabilistic model of the Erdés-Rényi graph is 
the Bernoulli random variable. 

To see how a graph can be constructed from a Bernoulli random variable, we first 
introduce the concept of a graph. A graph contains two elements: nodes and edges. For 
node i and node j, we denote the edge connecting i and j as A;;. Therefore, if we have N 
nodes, then we can construct a matrix A of size N x N. We call this matrix the adjacency 
matrix. For example, the adjacency matrix 


II 
oOrro& 
oO F 
oro 5S 


oCoOorF 


will have edges for node pairs (1,2), (1,3), and (3,4). Note that in this example we assume 
that the adjacency matrix is symmetric, meaning that the graph is undirected. The “1” in 
the adjacency matrix indicates there is an edge, and “O” indicates there is no edge. So A 
represents a binary graph. 

The Erdés-Rényi graph model says that the probability of getting an edge is an inde- 
pendent Bernoulli random variable. That is 


Aj; ~ Bernoulli(p), 


for 1 < j. If we model the graph in this way, then the parameter p will control the density 
of the graph. High values of p mean that there is a higher chance for an edge to be present. 


Roo wm + oO + YD w 
Ron + oO = YD w 


Figure 3.25: The Erdés-Rényi graph. [Top] The graphs. [Bottom] The adjacency matrices. 
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To illustrate the idea of an Erdés-Rényi graph, we show in Figure 3.25 a graph of 
40 nodes. The edges are randomly selected by flipping a Bernoulli random variable with 
parameter p = 0.3, 0.5,0.7,0.9. As we can see in the figure, a small value of p gives a graph 
with very sparse connectivity, whereas a large value of p gives a very densely connected 
graph. The bottom row of Figure 3.25 shows the corresponding adjacency matrices. Here, 
a white pixel denotes “1” in the matrix and a black pixel denotes “0” in the matrix. 

While Erdés-Rényi graphs are elementary, their variations can be realistic models of 
social networks. The stochastic block model is one such model. In a stochastic block model, 
nodes form small communities within a large network. For example, there are many majors 
in a university. Students within the same major tend to have more interactions than with 
students of another major. The stochastic block model achieves this goal by partitioning 
the nodes into communities. Within each community, the nodes can have a high degree of 
connectivity. Across different communities, the connectivity will be much lower. Figure 3.26 
illustrates a network and the corresponding adjacency matrix. In this example, the network 
has three communities. 


4 3 2 4 0 1 2 3 4 * ree of, “kh ‘2 
Figure 3.26: A stochastic block model containing three communities. [Left] The graph. [Right] The 
adjacency matrix. 


In network analysis, one of the biggest problems is determining the community struc- 
ture and recovering the underlying probabilities. The former task is about grouping the 
nodes into blocks. This is a nontrivial problem because in practice the nodes are never 
arranged nicely, as shown in Figure 3.26. For example, why should Alice be node 1 and 
Bob be node 2? Since we never know the correct ordering of the nodes, partitioning the 
nodes into blocks requires various estimation techniques such as clustering or iterative esti- 
mation. Recovering the underlying probability is also not easy. Given an adjacency matrix, 
why can we assume that the underlying network is a stochastic block model? Even if the 
model is correct, there will be imperfect grouping in the previous step. As such, estimat- 
ing the underlying probability in the presence of these uncertainties would pose additional 
challenges. 

Today, network analysis remains one of the hottest areas in data science. Its importance 
derives from its broad scope and impact. It can be used to analyze social networks, opinion 
polls, marketing, or even genome analysis. Nevertheless, the starting point of these advanced 
subjects is the Bernoulli random variable, the random variable of a coin flip! 
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3.5.2. Binomial random variable 


Suppose we flip the coin n times count the number of heads. Since each coin flip is a random 
variable (Bernoulli), the sum is also a random variable. It turns out that this new random 
variable is the binomial random variable. 


Definition 3.9. Let X be a binomial random variable. Then, the PMF of X is 


where 0 < p < 1 is the binomial parameter, and n is the total number of states. We 
write 


X ~ Binomial(n, p) 


to say that X is drawn from a binomial distribution with a parameter p of size n. 


To understand the meaning of a binomial random variable, consider a simple experiment 
consisting of flipping a coin three times. We know that all possible cases are HHH, HHT, 
HTH, THH, TTH, THT, HTT and TTT. Now, suppose we define X = number of heads. 
We want to write down the probability mass function. Effectively, we ask: What is the 
probability of getting 0 head, one head, two heads, and three heads? We can, of course, 
count and get the answer right away for a fair coin. However, suppose the coin is unfair, 
i.e., the probability of getting a head is p whereas that of a tail is 1 — p. The probability of 
getting each of the 8 cases is shown in Figure 3.27 below. 


©OSe Oe 
GO©® O20 2@082@ e266 
20O 820 
p p?(1—p) p(1 — p)? (1—p)? 


Figure 3.27: The probability of getting & heads out of n = 3 coins. 


Here are the detailed calculations. Let us start with X = 3. 


px(3) = P[{HHH}] 
= P[{H} 9 {H} 0 {H}}] 
© PLEPE IPED 


(0) 
— p> 


? 


where (a) holds because the three events are independent. (Recall that if A and B are 
independent then P[AM B] = P[AJP[B].) (b) holds because each P[{H}] = p by definition. 
With exactly the same argument, we can show that px (0) = P[{TTT}] = (1 — p)?. 
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Now, let us look at px (2), i-e., 2 heads. This probability can be calculated as follows: 


px (2) = P[{HHT} U {HTH} u {THH}] 


(©) P[{HHT}] + P[{HTH}] + P[{THH}}] 


© p(1 = p) + p?(1 = p) + p?(1 =p) = 3p?(1 - p), 

where (c) holds because the three events HHT, HTH and THH are disjoint in the sample 
space. Note that we are not using the independence argument in (c) but the disjoint argu- 
ment. We should not confuse the two. The step in (d) uses independence, because each coin 
flip is independent. 

The above calculation shows an interesting phenomenon: Although the three events 
HHT, HTH, and THH are different (in fact, disjoint), the number of heads in all the cases 
is the same. This happens because when counting the number of heads, the ordering of the 
heads and tails does not matter. So the same problem can be formulated as finding the 


number of combinations of { 2 heads and 1 tail }, which in our case is (3) = 3. 


To complete the story, let us also try px(1). This probability is 
px (1) = P[{TTH} U {HTT} U{THT}] = 3p(1 — p)?. 


Again, we see that the combination C = 3 appears in front of the p(1 — p)?. 


In general, the way to interpret the binomial random variable is to decouple the prob- 


abilities p, (1 — p), and the number of combinations (ie 


r= (7) p (ae 


-—_—v_~SCoé—prob getting k H’s prob getting n — k T’s 
number of combinations 
The running index k should go with 0,1,...,n. It starts with 0 because there could be zero 
heads in the sample space. Furthermore, we note that in this definition, two parameters are 
driving a binomial random variable: the number of Bernoulli trials n and the underlying 
probability for each coin flip p. As such, the notation for a binomial random variable is 
Binomial(n, p), with two arguments. 

The histogram of a binomial random variable is shown in Figure 3.28(a). Here, we con- 
sider the example where n = 10 and p = 0.5. To generate the histogram, we use 5000 samples. 
In MATLAB and Python, generating binomial random variables as in Figure 3.28(a) can 
be done by calling binornd and np.random. binomial. 


MATLAB code to generate 5000 Binomial random variables 
0.5; 
= 10; 


X = binornd(n,p, [5000,1]); 
[num, ~] = hist(X, 10); 
bar( num, ‘FaceColor’,[0.4, 0.4, 0.8]); 


# Python code to generate 5000 Binomial random variables 
import numpy as np 
import matplotlib.pyplot as plt 
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Figure 3.28: An example of a binomial distribution with n = 10, p = 0.5. 
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(a) Histogram based on 5000 samples PMF 


p= 0.5 

n = 10 

X = np.random. binomial (n,p,size=5000) 
plt.hist(X,bins=‘auto’); 


Generating the ideal PMF of a binomial random variable as shown in Figure 3.28(b) 
can be done by calling binopdf in MATLAB. In Python, we can define a random variable 
rv through stats.binom, and call the PMF using rv. pmf. 


MATLAB code to generate a binomial PMF 
0.5; 
= 10; 
= 0:10; 
binopdf (x,n,p); 
stem(x, f, ’o’, ’LineWidth’, 8, ’Color’, [0.8, 0.4, 0.4]); 


# Python code to generate a binomial PMF 
import numpy as np 
import matplotlib.pyplot as plt 
import scipy.stats as stats 
p=0.5 
10 
stats.binom(n,p) 
np.arange(11) 
rv.pmf (x) 
plt.plot(x, f, ’bo’, ms=10); 
plt.vlines(x, 0, f, colors=’b’, lw=5, alpha=0.5); 


The shape of the binomial PMF is shown in Figure 3.29. In this set of figures, we vary 
one of the two parameters n and p while keeping the other fixed. In Figure 3.29(a), we fix 
n = 60 and plot three sets of p = 0.1,0.5,0.9. For small p the PMF is skewed towards the 
left, and for large p the PMF is skewed toward the right. Figure 3.29(b) shows the PMF 
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for a fixed p = 0.5. As we increase n, the centroid of the PMF moves towards the right. 
Thus we should expect the mean of a binomial random variable to increase with p. Another 
interesting observation is that as n increases, the shape of the PMF approaches the Gaussian 
function (the bell-shaped curve). We will explain the reason for this when we discuss the 
Central Limit Theorem. 


0.2 


0.15 - 


0.1} 


0.05 - 


0 10 20 30 40 50 60 0 10 20 30 40 50 60 
(a) n = 60 (b) p=0.5 
Figure 3.29: PMFs of a binomial random variable X ~ Binomial(n, p). (a) We assume that n = 60. By 
varying the probability p, we see that the PMF shifts from the left to the right, and the shape changes. 


(b) We assume that p = 0.5. By varying the number of trials, the PMF shifts and the shape becomes 
more “bell-shaped.” 


The expectation, second moment, and variance of a binomial random variable are 
summarized in Theorem 3.7. 


Theorem 3.7. If X ~ Binomial(n, p), then 


E[X] = np, 


E[X?] = np(np + (1—p)), 
Var[X] = np(1 — p). 


We will prove that E[X] = np using the first principle. For E[X?] and Var[X], we will skip 
the proofs here and will introduce a “shortcut” later. 


Proof. Let us start with the definition. 


xX] =o. (ja - py 


3 
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Note that we have shifted the index from k = 0 to k = 1. Now let us apply a trick: 


5X] =np- > ot iP naa 
£=0 

= “ : i= Lh g gD 

=n ("deft p) p. 


summing PMF of Binomial(n—1,p) 


In MATLAB, the mean and variance of a binomial random variable can be found by 
calling the command binostat(n,p) (MATLAB). 
In Python, the command is rv = stats.binom(n,p) followed by calling rv.stats. 


MATLAB code to compute the mean and var of a binomial rv 
0.5; 
= 10; 
[M,V] = binostat(n, p) 


# Python code to compute the mean and var of a binomial rv 
import scipy.stats as stats 

p=0.5 

n= 10 

rv = stats.binom(n,p) 

M, V = rv.stats(moments=‘mv’ ) 

print(M, V) 


147 


CHAPTER 3. DISCRETE RANDOM VARIABLES 


An alternative view of the binomial random variable. As we discussed, the origin of a 
binomial random variable is the sum of a sequence of Bernoulli random variables. Because 
of this intrinsic definition, we can derive some useful results by exploiting this fact. To do so, 
let us define [,,..., J, as a sequence of Bernoulli random variables with J; ~ Bernoulli(p) 
for allz =1,...,n. Then the resulting variable 


Lah Spe, 


is a binomial random variable of size n and parameter p. Using this definition, we can 
compute the expectation as follows: 


IX] =E[, +lo+---+I,] 


In this derivation, the step (a) depends on a useful fact about expectation (which we have not 
yet proved): For any two random variables X and Y, it holds that ELX + Y] = E[X]+E[Y]. 
Therefore, we can show that the expectation of X is np. This line of argument not only 
simplifies the proof but also provides a good intuition of the expectation. If each coin flip 
has an expectation of E[J;] = p, then the expectation of the sum should be simply n times 
of p, given np. 


How about the variance? Again, we are going to use a very useful fact about variance: 
If two random variables X and Y are independent, then Var[X + Y] = Var[|X] + Var[Y]. 
With this result, we can show that 
Var[X] = Var[f, +--+ + In] 
= Var|[fi] +--+ + Var[Ln] 
=p =p) =) 
= np(1— p). 


Finally, using the fact that Var[X] = E[X?] — y?, we can show that 


o[X?] = Var[X] + p? 
= np(1— p) + (np)?. 


Practice Exercise 3.9. Show that the binomial PMF sums to 1. 
Solution. We use the binomial theorem to prove this result: 


n n 


Yopxt) =o (ha -pyr# = (+ pyr = 1. 


k=0 k=0 


The CDF of the binomial random variable is not very informative. It is basically the 
cumulative sum of the PMF: 
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0 5 10 15 20 25 30 0 5 10 15 20 25 30 
Figure 3.30: PMF and CDF of a binomial random variable X ~ Binomial(n, p). 


The shapes of the PMF and the CDF is shown in Figure 3.30. 

In MATLAB, plotting the CDF of a binomial can be done by calling the function 
binocdf. You may also call f = binopdf(x,n,p), and define F = cumsum(f) as the cumu- 
lative sum of the PMF. In Python, the corresponding command is rv = stats.binom(n,p) 
followed by rv.cdf. 


MATLAB code to compute the mean and var of a binomial rv 
0:10; 
0.5; 
= 10; 
binocdf (x,n,p); 
figure; stairs(x,F,‘.-’,‘LineWidth’ ,4, ‘MarkerSize’ ,30) ; 


# Python code to compute the mean and var of a binomial rv 
import numpy as np 

import matplotlib.pyplot as plt 

import scipy.stats as stats 

p= 0.5 


rv = stats.binom(n,p) 

x = np.arange(11) 

F rv.cdf (x) 

plt.plot(x, F, ’bo’, ms=10); 

plt.vlines(x, 0, F, colors=’b’, lw=5, alpha=0.5); 


3.5.3, Geometric random variable 


In some applications, we are interested in trying a binary experiment until we succeed. For 
example, we may want to keep calling someone until the person picks up the call. In this 
case, the random variable can be defined as the outcome of many failures followed by a final 
success. This is called the geometric random variable. 


Definition 3.10. Let X be a geometric random variable. Then, the PMF of X is 


px(k) =(1—p)*~'p, 
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where 0 < p< 1 is the geometric parameter. We write 


X ~ Geometric(p) 


to say that X is drawn from a geometric distribution with a parameter p. 


A geometric random variable is easy to understand. We define it as Bernoulli trials with 
k — 1 consecutive failures followed by one success. This can be seen from the definition: 


px(k)=_— (1—p)*? Pp. 
—$—$—$—$_$ $< _  ————_o 
k — 1 failures final success 


Note that in geometric random variables, there is no (3 because we must have k — 1 
consecutive failures before one success. There is no alternative combination of the sequence. 

The histogram and PMF of a geometric random variable are illustrated in Figure 3.31. 
Here, we assume that p = 0.5. 
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(a) Histogram based on 5000 samples (b) PMF 


Figure 3.31: An example of a geometric distribution with p = 0.5. 


In MATLAB, generating geometric random variables can be done by calling the com- 
mands geornd. In Python, it is np.random. geometric. 


MATLAB code to generate 1000 geometric random variables 
0.5; 
geornd(p, [5000,1]); 
(num, ~] = hist(X, 0:10); 
bar(0:10, num, ‘FaceColor’,[0.4, 0.4, 0.8]); 


# Python code to generate 1000 geometric random variables 
import numpy as np 
import matplotlib.pyplot as plt 

0.5 

np.random.geometric(p,size=1000) 

-hist(X,bins=‘auto’); 


To generate the PMF plots, in MATLAB we call geopdf and in Python we call 
rv = stats.geom followed by rv.pmf. 
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MATLAB code to generate geometric PMF 
= 0.5; x = 0:10; 
= geopdf(x,p); 
stem(x, f, ‘o’, ‘LineWidth’, 8, ‘Color’, [0.8, 0.4, 0.4]); 


# Python code to generate 1000 geometric random variables 
import numpy as np 

import matplotlib.pyplot as plt 

import scipy.stats as stats 

x = np.arange(1,11) 

rv = stats.geom(p) 

f rv.pmf (x) 

plt.plot(x, f, ‘bo’, ms=8, label=‘geom pmf’) 
plt.vlines(x, 0, f, colors=‘b’, lw=5, alpha=0.5) 


Practice Exercise 3.10. Show that the geometric PMF sums to one. 


Solution. We can apply infinite series to show the result: 


It is interesting to compare the shape of the PMF's for various values of p. In Figure 3.32 
we show the PMFs. We vary the parameter p = 0.25,0.5,0.9. For small p, the PMF starts 
with a low value and decays at a slow speed. The opposite happens for a large p, where the 
PMF starts with a high value and decays rapidly. 

Furthermore, we can derive the following properties of the geometric random variable. 


Theorem 3.8. If X ~ Geometric(p), then 


[xX] 


d 


mel 
P 
1 


a’, 
Pp 


Proof. We will prove that the mean is 1/p and leave the second moment and variance as 
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Figure 3.32: PMFs of a geometric random variable X ~ Geometric(p). 


an exercise. 
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where (a) follows from the infinite series identity in Chapter 1. 


3.5.4 Poisson random variable 


In many physical systems, the arrivals of events are typically modeled as a Poisson ran- 
dom variable, e.g., photon arrivals, electron emissions, and telephone call arrivals. In social 
networks, the number of conversations per user can also be modeled as a Poisson. In e- 
commerce, the number of transactions per paying user is again modeled using a Poisson. 


Definition 3.11. Let X be a Poisson random variable. Then, the PMF of X is 


E0139. 
where \ > 0 is the Poisson rate. We write 
X ~ Poisson(X) 


to say that X is drawn from a Poisson distribution with a parameter X. 


In this definition, the parameter \ determines the rate of the arrival. The histogram and 
PMF of a Poisson random variable are illustrated in Figure 3.33. Here, we assume that 
A=1. 

The MATLAB code and Python code used to generate the histogram are shown below. 


% MATLAB code to generate 5000 Poisson numbers 
lambda = 1; 
X = poissrnd(lambda, [5000,1]); 
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Figure 3.33: An example of a Poisson distribution with A = 1. 


[num, ~] = hist(X, 0:10); 
bar(0:10, num, ‘FaceColor’,[0.4, 0.4, 0.8]); 


# Python code to generate 5000 Poisson random variables 
import numpy as np 
import matplotlib.pyplot as plt 


lambd = i 
X = np.random.poisson(lambd,size=5000) 
plt.hist(X,bins=‘auto’); 


For the PMF, in MATLAB we can call poisspdf, and in Python we can call rv. pmf 
with rv = stats.poisson. 


% MATLAB code to plot the Poisson PMF 

lambda = 1; 

x = 0:10; 

f = poisspdf (x, lambda) ; 

stem(x, f, ‘o’, ‘LineWidth’, 8, ‘Color’, [0.8, 0.4, 0.4]); 


# Python code to plot the Poisson PMF 
import numpy as np 
import matplotlib.pyplot as plt 
import scipy.stats as stats 
x = np.arange(0,11) 
rv = stats.poisson(lambd) 
rv.pmf (x) 
plt.plot(x, f, ‘bo’, ms=8, label=‘geom pmf’) 
plt.vlines(x, 0, f, colors=‘b’, lw=5, alpha=0.5) 


The shape of the Poisson PMF changes with X. As illustrated in Figure 3.34, px (k) is 
more concentrated at lower values for smaller \ and becomes spread out for larger . Thus, 
we should expect that the mean and variance of a Poisson random variable will change 
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together as a function of X. In the same figure, we show the CDF of a Poisson random 
variable. The CDF of a Poisson is 


k 
My 
Fx(k) = PIX SH => ae ; (3.17) 
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Figure 3.34: A Poisson random variable using different X's. [Left] Probability mass function px(k). 
[Right] Cumulative distribution function Fx (k). 


Example 3.18. Let X be a Poisson random variable with parameter \. Find P[X > 4] 
and P[X < 5]. 


Solution. 


Practice Exercise 3.11. Show that the Poisson PMF sums to 1. 


Solution. We use the exponential series to prove this result: 


Poisson random variables in practice 


(1) Computational photography. In computational photography, the Poisson random vari- 
able is one of the most widely used models for photon arrivals. The reason pertains to the 
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origin of the Poisson random variable, which we will discuss shortly. When photons are emit- 
ted from the source, they travel through the medium as a sequence of independent events. 
During the integration period of the camera, the photons are accumulated to generate a 
voltage that is then translated to digital bits. 


integration period t 


OO OOO OO OOOO 0800-00000 > 
\ rate ; 
arrival rate = a duration 
\ J) 
a vA 
X = number of photons — at 


Figure 3.35: The Poisson random variable can be used to model photon arrivals. 


If we assume that the photon arrival rate is a (photons per second), and suppose that 
the total amount of integration time is t, then the average number of photons that the sensor 
can see is at. Let X be the number of photons seen during the integration time. Then if we 
follow the Poisson model, we can write down the PMF of X: 


(at)* —at 


Therefore, if a pixel is bright, meaning that a is large, then X will have a higher likelihood 
of landing on a large number. 


(2) Traffic model. The Poisson random variable can be used in many other problems. For 
example, we can use it to model the number of passengers on a bus or the number of spam 
phone calls. The required modification to Figure 3.35 is almost trivial: merely replace the 
photons with your favorite cartoons, e.g., a person or a phone, as shown in Figure 3.36. In 
the United States, shared-ride services such as Uber and Lyft need to model the vacant cars 
and the passengers. As long as they have an arrival rate and certain degrees of independence 
between events, the Poisson random variable will be a good model. 

As you can see from these examples, the Poisson random variable has broad applica- 
bility. Before we continue our discussion of its applications, let us introduce a few concepts 
related to the Poisson random variable. 


Properties of a Poisson random variable 


We now derive the mean and variance of a Poisson random variable. 


Theorem 3.9. If X ~ Poisson(A), then 


= A2 Eel 
Var[X] = 2. 
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Figure 3.36: The Poisson random variable can be used to model passenger arrivals and the number of 


phone calls, and can be used by Uber or Lyft to provide shared rides. 


Proof. Let us first prove the mean. It can be shown that 
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The variance can be computed using Var[X] = E[X?] — p?. 


To compute the mean and variance of a Poisson random variable, we can call poisstat 
in MATLAB and rv.stats(moments=‘mv’) in Python. 


% MATLAB code to compute Poisson statistics 


lambda = 1; 
[M,V] = poisstat (lambda) ; 


156 


3.5. COMMON DISCRETE RANDOM VARIABLES 


# Python code to compute Poisson statistics 
import scipy.stats as stats 
lambd = 1 


rv = stats.poisson(lambd) 
M, V = rv.stats(moments=’mv’ ) 


The Poisson random variable is special in the sense that the mean and the variance are 
equal. That is, if the mean arrival number is higher, the variance is also higher. This is very 
different from some other random variables, e.g., the normal random variable where the mean 
and variance are independent. For certain engineering applications such as photography, this 
plays an important role in defining the signal-to-noise ratio. We will come back to this point 
later. 


Origin of the Poisson random variable 


We now address one of the most important questions about the Poisson random variable: 
Where does it come from? Answering this question is useful because the derivation process 
will reveal the underlying assumptions that lead to the Poisson PMF. When you change 
the problem setting, you will know when the Poisson PMF will hold and when the Poisson 
PMF will fail. 

Our approach to addressing this problem is to consider the photon arrival process. 
(As we have shown, there is conceptually no difference if you replace the photons with 
pedestrians, passengers, or phone calls.) Our derivation follows the argument of J. Goodman, 
Statistical Optics, Section 3.7.2. 

To begin with, we consider a photon arrival process. The total number of photons 
observed over an integration time t is defined as X(t). Because X(t) is a Poisson random 
variable, its arguments must be integers. The probability of observing X(t) = k is therefore 
P[X(t) = k]. Figure 3.37 illustrates the notations and concepts. 


X(t+ At) 


Figure 3.37: Notations for deriving the Poisson PMF. 


We propose three hypotheses with the photon arrival process: 


e For sufficiently small At, the probability of a small impulse occurring in the time 
interval [t,t + At] is equal to the product of At and the rate 4, ie., 


P[X(t + At) — X(t) = 1] = AAt. 


This is a linearity assumption, which typically holds for a short duration of time. 
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e For sufficiently small At, the probability that more than one impulse falls in At is 
negligible. Thus, we have that P[|X(t + At) — X(t) = 0] = 1— AAt. 


e The number of impulses in non-overlapping time intervals is independent. 


The significance of these three hypotheses is that if the underlying photon arrival process 
violates any of these assumptions, then the Poisson PMF will not hold. One example is the 
presence of scattering effects, where a photon has a certain probability of going off due to 
the scattering medium and a certain probability of coming back. In this case, the events will 
no longer be independent. 

Assuming that these hypotheses hold, then at time ¢+ At, the probability of observing 
X(t + At) =k can be computed as 


=P[X(t)=k]- (1-AAt) + PLX(t)=k-1]- (AAt) 
—-_— ——_" 
P[X(t+At)—X(t)=0] =P[X(t+At)—X(t)=1] 
= P[X(t) = k] — P[X(t) = K)AAt + P[X(t) =k — IAAL. 


By rearranging the terms we show that 


P[X(t + At) = k]-PIX(t) =A] _, 
At — 


(Fixe =k-1)—P[X(t) = Kl). 


Setting the limit of At > 0, we arrive at an ordinary differential equation 
d 
a“ x) =k|= (PX =k—1)-P[X(t) = Kl). (3.19) 


We claim that the Poisson PMF, i-e., 


would solve this differential equation. To see this, we substitute the PMF into the equation. 
The left-hand side gives us 


dt dt \ k! 
k-1 k 
= Xk a et (—A) “) et 
<a oot Qo a 
~ “(k—-1)! k! 


which is the right-hand side of the equation. To retrieve the basic form of Poisson, we can 
just set t = 1 in the PMF so that 


158 


3.5. COMMON DISCRETE RANDOM VARIABLES 


The origin of Poisson random variables 


e We assume independent arrivals. 


Probability of seeing one event is linear with the arrival rate. 


Time interval is short enough so that you see either one event or no event. 
Poisson is derived by solving a differential equation based on these assumptions. 


Poisson becomes invalid when these assumptions are violated, e.g., in the case 
of scattering of photons due to turbid medium. 


There is an alternative approach to deriving the Poisson PMF. The idea is to drive 
the parameter n in the binomial random variable to infinity while pushing p to zero. In this 
limit, the binomial PMF will converge to the Poisson PMF. We will discuss this shortly. 
However, we recommend the physics approach we have just described because it has a rich 
meaning and allows us to validate our assumptions. 


Poisson approximation to binomial 


We present one additional result about the Poisson random variable. The result shows that 
Poisson can be regarded as a limiting distribution of a binomial random variable. 


Theorem 3.10. (Poisson approximation to binomial). For small p and large n, 


where  % np. 


Before we prove the result, let us see how close the approximation can be. In Figure 3.38, 
we show a binomial distribution and a Poisson approximation. The closeness of the approx- 
imation can easily be seen. 

In MATLAB, the code to approximate a binomial distribution with a Poisson formula 
is shown below. Here, we draw 10,000 random binomial numbers and plot their histogram. 
On top of the plot, we use poisspdf to compute the Poisson PMF. This gives us Figure 3.38. 
A similar set of commands can be called in Python. 


MATLAB code to approximate binomial using Poisson 
= 1000; p = 0.05; 
= binornd(n,p,[10000,1]); 
= 0:100; 
{num,val] = hist(X,t); 


lambda = n*p; 

f_pois = poisspdf(t,lambda) ; 

bar (num/10000, ‘FaceColor’,[0.9 0.9 0], ‘BarWidth’,1); hold on; 
plot(f_pois, ‘LineWidth’, 4); 
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Figure 3.38: Poisson approximation of binomial distribution. 


# Python code to approximate binomial using Poisson 


import numpy as np 
import matplotlib.pyplot as plt 
import scipy.stats as stats 
1000; p = 0.05 
stats.binom(n,p) 
rvi.rvs(size=10000) 
.figure(1); plt.hist(X,bins=np.arange(0,100)); 


stats .poisson(n*p) 
rv2.pmf (bin) 
.figure(2); plt.plot(f); 


= nln DEED (yA) 
=a (1-2) ( —)(1 a) een) 


~ 
—1 as n—-00 


We claim that (1 — 4)" + e~>. This can be proved by noting that 


log(l+2)* a, x<i. 


It then follows that log (1 — 2) ~ —4. Hence, (1— 4)" we 
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Example 3.19. Consider an optical communication system. The bit arrival rate is 10° 
bits/sec, and the probability of having one error bit is 10~°. Suppose we want to find 
the probability of having five error bits in one second. 


Let X be the number of error bits. In one second there are 10° bits. Since we 
do not know the location of these 5 bits, we have to enumerate all possibilities. This 
leads to a binomial distribution. Using the binomial distribution, we know that the 
probability of having & error bits is 


(ja = 


1 Q) 
_ ‘ Jao prey 


This quantity is difficult to calculate in floating-point arithmetic. 


Using the Poisson to binomial approximation, we can see that the probability can 
be approximated by 
k 
a ed 


P[X = k] ¥ Fe ; 


where \ = np = 10°(10-°) = 1. Setting k = 5 yields P[X = 5] ~ 0.003. 


Photon arrival statistics 


Poisson random variables are useful in computer vision, but you may skip this discussion 


if it is your first reading of the book. 


The strong connection between Poisson statistics and physics makes the Poisson ran- 
dom variable a very good fit for many physical experiments. Here we demonstrate an appli- 
cation in modeling photon shot noise. 


An image sensor is a photon sensitive device which is used to detect incoming photons. 
In the simplest setting, we can model a pixel in the object plane as X,,,,, for some 2D 
coordinate [m,n] € R?. Written as an array, an M x N image in the object plane can be 
visualized as 
X10 X12 ++ Xin 
X = object = : : ; : 
XuMi %m2 +*: XM.N 


Without loss of generality, we assume that Xj, is normalized so that 0 < Xm < 1 for 
every coordinate [m,n]. To model the brightness, we multiply X,,,, by a scalar a > 0. If 
a pixel aX», has a large value, then it is a bright pixel; conversely, if aXm,, has a small 
value, then it is a dark pixel. At a particular pixel location [m,n] € R?, the observed pixel 
value Yin, is a random variable following the Poisson statistics. This situation is illustrated 
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in Figure 3.39, where we see that an object-plane pixel will generate an observed pixel 
through the Poisson PMF.! 


Xmn Ymn 
Figure 3.39: The image formation process is governed by the Poisson random variable. Given a pixel 
in the object plane Xm.n, the observed pixel Ymjn is a Poisson random variable with mean aXm,n. 


Therefore, a brighter pixel will have a higher Poisson mean, whereas a darker pixel will have a lower 
Poisson mean. 


Written as an array, the image is 


Y = observed image 


= Poisson{ aX} 


Poisson{aX1,1} Poisson{aX1,2} --- Poisson{aX,,~} 
Poisson{aX21}  Poisson{aX22} --- Poisson{aX2,n} 
Poisson{aXy,1} Poisson{faXy.2} --- Poisson{aXyy,n} 


Here, by Poisson{aXm,n} we mean that Ym,» is a random integer with probability mass 
OX inl” 
k! 


-—aXm,n 


Note that this model implies that the images seen by our cameras are more or less 
an array of Poisson random variables. (We say “more or less” because of other sources of 
uncertainties such as read noise, dark current, etc.) Because the observed pixels Y,,, are 
random variables, they fluctuate about the mean values, and hence they are noisy. We refer 
to this type of random fluctuation as the shot noise. The impact of the shot noise can be 
seen in Figure 3.40. Here, we vary the sensor gain level a. We see that for small a the image 
is dark and has much random fluctuation. As @ increases, the image becomes brighter and 
the fluctuation becomes smaller. 

In MATLAB, simulating the Poisson photon arrival process for an image requires the 
image-processing toolbox. The command to read an image is imread. Depending on the data 
type, the input array could be unit8 integers. To convert them to floating-point numbers 
between 0 and 1, we use the command im2double. Drawing Poisson measurements from the 
clean image is done using poissrnd. Finally, we can use imshow to display the image. 


1The color of an image is often handled by a color filter array, which can be thought of as a wavelength 
selector that allows a specific wavelength to pass through. 
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a = 100 a = 1000 


Figure 3.40: Illustration of the Poisson random variable in photographing images. Here, a denotes the 
gain level of the sensor: Larger a means that there are more photons coming to the sensor. 


% MATLAB code to simulate a photon arrival process 
xO = im2double(imread(’cameraman.tif’)); 

X = poissrnd(10*x0) ; 

figure(1); imshow(x0, []); 

figure(2); imshow(X, []); 


Similar commands can be found in Python with the help of the cv2 library. When 
reading an image, we call cv2.imread. The option 0 is used to read a gray-scale image; 
otherwise, we will have a 3-channel color image. The division /255 ensures that the input 
array ranges between 0 to 1. Generating the Poisson random numbers can be done using 
np.random. poisson, or by calling the statistics library with stats.poisson.rvs(10*x0). 
To display the images, we call p1lt.imshow, with the color map option set to cmap = ’gray’. 


# Python code code to simulate a photon arrival process 
import numpy as np 

import matplotlib.pyplot as plt 

import cv2 

xO = cv2.imread(’./cameraman.tif’, 0)/255 
plt.figure(1); plt.imshow(x0,cmap=’ gray’); 

X = np.random.poisson(10*x0) 

plt.figure(2); plt.imshow(X, cmap=’gray’); 


Why study Poisson? What is shot noise? 


e The Poisson random variable is used to model photon arrivals. 


e Shot noise is the random fluctuation of the photon counts at the pixels. Shot 
noise is present even if you have an ideal sensor. 


Signal-to-noise ratio of Poisson 


Now let us answer a question we asked before. A Poisson random variable has a variance 
equal to the mean. Thus, if the scene is brighter, the variance will be larger. How come our 
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simulation in Figure 3.40 shows that the fluctuation becomes smaller as the scene becomes 
brighter? 

The answer to this question lies in the signal-to-noise ratio (SNR) of the Poisson 
random variable. The SNR of an image defines its quality. The higher the SNR, the better 
the image. The mathematical definition of SNR is the ratio between the signal power and 
the noise power. In our case, the SNR is 

signal power 


SNR = — 
noise power 


def E[Y] 
~ \/VarlY] 


(a) A Jk 
=> —_ = r, 
Vx 


where Y = Y,,,», is one of the observed pixels and A = aX,,,,, is the the corresponding object 
pixel. In this equation, the step (a) uses the properties of the Poisson random variable Y 
where E[Y] = Var[Y] = \. The result SNR = V4 is very informative. It says that if the 
underlying mean photon flux (which is A) increases, the SNR increases at a rate of Vo. 
So, yes, the variance becomes larger when the scene is brighter. However, the gain in signal 
#[Y] overrides the gain in noise \/Var[Y]. As a result, the big fluctuation in bright images 
is compensated by the strong signal. Thus, to minimize the shot noise one has to use a 
longer exposure to increase the mean photon flux. When the scene is dark and the aperture 
is small, shot noise is unavoidable. 

Poisson modeling is useful for describing the problem. However, the actual engineering 
question is that, given a noise observation Yi,,, how would you reconstruct the clean image 
Xin? This is a very difficult inverse problem. The typical strategy is to exploit the spatial 
correlations between nearby pixels, e.g., usually smooth except along some sharp edges. 
Other information about the image, e.g., the likelihood of obtaining texture patterns, can 
also be leveraged. Modern image-processing methods are rich, ranging from classical filtering 
techniques to deep neural networks. Static images are easier to recover because we can often 
leverage multiple measurements of the same scene to boost the SNR. Dynamic scenes are 
substantially harder when we need to track the motion of any underlying objects. There are 
also newer image sensors with better photon sensitivity. The problem of imaging in the dark 
is an important research topic in computational imaging. New solutions are developed at 
the intersection of optics, signal processing, and machine learning. 


The end of our discussions on photon statistics. 
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3.6 Summary 


A random variable is so called because it can take more than one state. The probability mass 
function specifies the probability for it to land on a particular state. Therefore, whenever 
you think of a random variable you should immediately think of its PMF (or histogram 
if you prefer). The PMF is a unique characterization of a random variable. Two random 
variables with the same PMF are effectively the same random variables. (They are not 
identical because there could be measure-zero sets where the two differ.) Once you have the 
PMF, you can derive the CDF, expectation, moments, variance, and so on. 

When your boss hands a dataset to you, which random variable (which model) should 
you use? This is a very practical and deep question. We highlight three steps for you to 
consider: 


e (i) Model selection: Which random variable is the best fit for our problem? Some- 
times we know by physics that, for example, photon arrivals or internet traffic follow a 
Poisson random variable. However, not all datasets can be easily described by simple 
models. The models we have learned in this chapter are called the parametric mod- 
els because they are characterized by one or two parameters. Some datasets require 
nonparametric models, e.g., natural images, because they are just too complex. Some 
data scientists refer to deep neural networks as parametric models because the net- 
work weights are essentially the parameters. Some do not because when the number 
of parameters is on the order of millions, sometimes even more than the number of 
training samples, it seems more reasonable to call these models nonparametric. How- 
ever, putting this debate aside, shortlisting a few candidate models based on prior 
knowledge is essential. Even if you use deep neural networks, selecting between con- 
volutional structures versus long short-term memory models is still a legitimate task 
that requires an understanding of your problem. 


e (ii) Parameter estimation: Suppose that you now have a candidate model; the next 
task is to estimate the model parameter using the available training data. For example, 
for Poisson we need to determine \, and for binomial we need to determine (n,p). The 
estimation problem is an inverse problem. Often we need to use the PMF to construct 
certain optimization problems. By solving the optimization problem we will find the 
best parameter (for that particular candidate model). Modern machine learning is 
doing significantly better now than in the old days because optimization methods 
have advanced greatly. 


e (iii) Validation. When each candidate model has been optimized to best fit the data, 
we still need to select the best model. This is done by running various testings. For 
example, we can construct a validation set and check which model gives us the best 
performance (such as classification rate or regression error). However, a model with 
the best validation score is not necessarily the best model. Your goal should be to seek 
a good model and not the best model because determining the best requires access to 
the testing data, which we do not have. Everything being equal, the common wisdom 
is to go with a simpler model because it is generally less susceptible to overfitting. 


165 


CHAPTER 3. DISCRETE RANDOM VARIABLES 


3.7 References 


Probability textbooks 


3-1 Dimitri P. Bertsekas and John N. Tsitsiklis, Introduction to Probability, Athena Sci- 
entific, 2nd Edition, 2008. Chapter 2. 


3-2 Alberto Leon-Garcia, Probability, Statistics, and Random Processes for Electrical E'n- 
gineering, Prentice Hall, 3rd Edition, 2008. Chapter 3. 


3-3 Athanasios Papoulis and S. Unnikrishna Pillai, Probability, Random Variables and 
Stochastic Processes, McGraw-Hill, 4th Edition, 2001. Chapters 3 and 4. 


3-4 John A. Gubner, Probability and Random Processes for Electrical and Computer E'n- 
gineers, Cambridge University Press, 2006. Chapters 2 and3. 


3-5 Sheldon Ross, A First Course in Probability, Prentice Hall, 8th Edition, 2010. Chap- 
ter 4. 


3-6 Henry Stark and John Woods, Probability and Random Processes With Applications 
to Signal Processing, Prentice Hall, 3rd Edition, 2001. Chapters 2 and 4. 


Advanced probability textbooks 


3-7 William Feller, An Introduction to Probability Theory and Its Applications, Wiley and 
Sons, 3rd Edition, 1950. 


3-8 Andrey Kolmogorov, Foundations of the Theory of Probability, 2nd English Edition, 
Dover 2018. (Translated from Russian to English. Originally published in 1950 by 
Chelsea Publishing Company New York.) 


Cross-validation 
3-9 Larry Wasserman, All of Statistics, Springer 2004. Chapter 20. 


3-10 Mats Rudemo, “Empirical Choice of Histograms and Kernel Density Estimators,” 
Scandinavian Journal of Statistics, Vol. 9, No. 2 (1982), pp. 65-78. 


3-11 David W. Scott, Multivariate Density Estimation: Theory, Practice, and Visualization, 
Wiley, 1992. 


Poisson statistics 


3-12 Joseph Goodman, Statistical Optics, Wiley, 2015. Chapter 3. 


3-13 Henry Stark and John Woods, Probability and Random Processes With Applications 
to Signal Processing, Prentice Hall, 3rd edition, 2001. Section 1.10. 


3.8. PROBLEMS 


3.8 Problems 


Exercise 1. (VIDEO SOLUTION) 

Consider an information source that produces numbers k in the set Sx = {1,2,3,4}. Find 
and plot the PMF in the following cases: 

(a) py = pi/k, for k = 1,2,3,4. Hint: Find p,. 
(b) proi = pe/2 for k = 1, 2,3. 


) 
) 
(c) Peo = Dy / 2" for k = 1, 2, 3. 

) 


(d) Can the random variables in parts (a)-(c) be extended to take on values in the set 


{1,2,...}? Why or why not? Hint: You may use the fact that the series 1+ $+ 3 fees 
diverges. 


Exercise 2. (VIDEO SOLUTION) 
Two dice are tossed. Let X be the absolute difference in the number of dots facing up. 


(a) Find and plot the PMF of X. 
(b) Find the probability that X < 2. 
(c) Find E[X] and Var[X]. 


Exercise 3. (VIDEO SOLUTION) 
Let X be a random variable with PMF p, = c/2" for k = 1,2,.... 


(a) Determine the value of c. 
(b) Find P(X > 4) and P(6 < X <8). 
(c) Find E[X] and Var[X]. 
Exercise 4. 
Let X be a random variable with PMF p, = c/2* for k = —1,0,1, 2,3, 4,5. 


a) Determine the value of c. 
b 
(c 


(d 


( 
(b) Find P(1 <_X <3) and P(1 < X <5). 
Find P[X <5]. 


SS NS te 


Find the PMF and the CDF of X. 
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Exercise 5. (VIDEO SOLUTION) 

A modem transmits a +2 voltage signal into a channel. The channel adds to this sig- 
nal a noise term that is drawn from the set {0,—1,—2,—3} with respective probabilities 
{4/10, 3/10, 2/10, 1/10}. 


a) Find the PMF of the output Y of the channel. 


( 
( 


) 
b) What is the probability that the channel’s output is equal to the input of the channel? 
(c) What is the probability that the channel’s output is positive? 

) 


(d) Find the expected value and variance of Y. 


Exercise 6. 

On a given day, your golf score takes values from numbers 1 through 10, with equal proba- 
bility of getting each one. Assume that you play golf for three days, and assume that your 
three performances are independent. Let X,, X2, and X3 be the scores that you get, and 
let X be the minimum of these three numbers. 


(a) Show that for any discrete random variable X, px(k) = P(X >k-—1)—P(X > k). 
(b) What is the probability P(X, > k) for k =1,...,10? 
(c) Use (a), determine the PMF px(k), for k =1,...,10. 
) 


(d) What is the average score improvement if you play just for one day compared with 
playing for three days and taking the minimum? 


Exercise 7. (VIDEO SOLUTION) 
Let 
1, ifx>1 X-1 if X —1 
qe i ae aaa FOO 0, i 0>0 
0, otherwise. 


(a) Find E[g(X)] for X as in Problem l(a) with Sy = {1,...,15}. 


0, otherwise. 


(b) Find E[h(X)] for X as in Problem 1(b) with Sx = {1,...,15}. 


Exercise 8. (VIDEO SOLUTION) 
A voltage X is uniformly distributed in the set {—3,...,3,4}. 


a) Find the mean and variance of X. 


( 
(b) Find the mean and variance of Y = —2X? +3. 


) 
) 

(c) Find the mean and variance of W = cos(7X/8). 
) 


(d) Find the mean and variance of Z = cos?(7X/8). 


Exercise 9. (VIDEO SOLUTION) 
(a) If X is Poisson(\), compute E[1/(X + 1)]. 
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(b) If X is Bernoulli(p) and Y is Bernoulli(g), compute E[(X + Y)%] if X and Y are 
independent. 


(c) Let X be a random variable with mean y and variance a”. Let A(@) = E[(X — @)?). 
Find @ that minimizes the error A(6). 


(d) Suppose that X,,...,X, are independent uniform random variables in {0,1,..., 100}. 
Evaluate P[min(X1,...,Xn) > 4 for any £ € {0,1,..., 100}. 


Exercise 10. (VIDEO SOLUTION) 


(a) Consider the binomial probability mass function px (k) = (%)p*(1— p)"~*. Show that 
the mean is E[X] = np. 


(b) Consider the geometric probability mass function px(k) = p(1— p)* for k = 0,1,.... 
Show that the mean is E[X] = (1 — p)/p. 


(c) Consider the Poisson probability mass function px(k) = Ane, Show that the vari- 


ance is Var[X] = X. 
(d 


NS 


Consider the uniform probability mass function px (k) = + fork =1,...,L. Show that 
the variance is Var[X] = 4451. Hint: 142+---+n = 2@*) and 1242?4--.+n? = 


; 5 12 
n n n 
a es 


Exercise 11. (VIDEO SOLUTION) 

An audio player uses a low-quality hard drive. The probability that the hard drive fails after 
being used for one month is 1/12. If it fails, the manufacturer offers a free-of-charge repair 
for the customer. For the cost of each repair, however, the manufacturer has to pay $20. 
The initial cost of building the player is $50, and the manufacturer offers a 1-year warranty. 
Within one year, the customer can ask for a free repair up to 12 times. 


(a) Let X be the number of months when the player fails. What is the PMF of X? Hint: 
P[X = 1] may not be very high because if the hard drive fails it will be fixed by the 
manufacturer. Once fixed, the drive can fail again in the remaining months. So saying 
X = 1 is equivalent to saying that there is only one failure in the entire 12-month 
period. 


(b) What is the average cost per player? 


Exercise 12. (VIDEO SOLUTION) 

A binary communication channel has a probability of bit error of p = 10~°. Suppose that 
transmission occurs in blocks of 10,000 bits. Let N be the number of errors introduced by 
the channel in a transmission block. 


(a) What is the PMF of N? 
(b) Find P[N = 0] and P[N < 3]. 


(c) For what value of p will the probability of 1 or more errors in a block be 99%? 
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Hint: Use the Poisson approximation to binomial random variables. 


Exercise 13. (VIDEO SOLUTION) 

The number of orders waiting to be processed is given by a Poisson random variable with 
parameter a = A/np, where 2 is the average number of orders that arrive in a day, pu is the 
number of orders that an employee can process per day, and n is the number of employees. 
Let \ = 5 and uw = 1. Find the number of employees required so the probability that more 
than four orders are waiting is less than 10%. 


Hint: You need to use trial and error for a few n’s. 


Exercise 14. 

Let X be the number of photons counted by a receiver in an optical communication system. 
It is known that X is a Poisson random variable with a rate \, when a signal is present and a 
Poisson random variable with the rate Ay < A; when a signal is absent. The probability that 
the signal is present is p. Suppose that we observe X = k photons. We want to determine a 
threshold T such that if k > T we claim that the signal is present, and if k < T we claim 
that the signal is absent. What is the value of T’? 
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Continuous Random Variables 


If you are coming to this chapter from Chapter 3, we invite you to take a 30-second pause 
and switch your mind from discrete events to continuous events. Everything is continuous 
now. The sample space is continuous, the event space is continuous, and the probability 
measure is continuous. Continuous random variables are similar in many ways to discrete 
random variables. They are characterized by the probability density functions (the continu- 
ous version of the probability mass functions); they have cumulative distribution functions; 
they have means, moments, and variances. The most significant difference is perhaps the use 
of integration instead of summation, but this change is conceptually straightforward, aside 
from the difficulties associated with integrating functions. So why do we need a separate 
chapter for continuous random variables? There are several reasons. 


e First, how would you define the probability of a continuous event? Note that we cannot 
count because a continuous event is uncountable. There is also nothing called the 
probability mass because there are infinitely many masses. To define the probability 
of continuous events, we need to go back to our “slogan”: probability is a measure 
of the size of a set. Because probability is a measure, we can speak meaningfully 
about the probability of continuous events so long as we have a well-defined measure 
for them. Defining such a measure requires some effort. We will develop the intuitions 
and the formal definitions in Section 4.1. In Section 4.2, we will discuss the expectation 
and variance of continuous random variables. 


e The second challenge is the unification between continuous and discrete random vari- 
ables. Since the two types of random variables ultimately measure the size of a set, it 
is natural to ask whether we can unify them. Our approach to unifying them is based 
on the cumulative distribution functions (CDFs), which are well-defined functions for 
discrete and continuous random variables. Based on the CDF and the fundamental 
theorem of calculus, we can show that the probability density functions and proba- 
bility mass functions can be derived from the derivative of the CDFs. These will be 
discussed in Section 4.3, and in Section 4.4 we will discuss some additional results 
about the mode and median. 


e The third challenge is to understand several widely used continuous random variables. 
We will discuss the uniform random variable and the exponential random variable 
in Section 4.5. Section 4.6 deals with the important topic of the Gaussian random 
variable. Where does a Gaussian random variable come from? Why does it have a bell 
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shape? Why are Gaussian random variables so popular in data science? What are the 
useful properties of Gaussian random variables? What are the relationships between 
a Gaussian random variable and other random variables? These important questions 
will be answered in Section 4.6. 


e The final challenge is the transformation of random variables. Imagine that you have a 
random variable X and a function g. What will the probability mass/density function 
of g(X) be? Addressing this problem is essential because almost all practical engineer- 
ing problems involve the transformation of random variables. For example, suppose 
we have voltage measurements and we would like to compute the power. This requires 
taking the square of the voltage. We will discuss the transformation in Section 4.7, 
and we will also discuss an essential application in generating random numbers in 
Section 4.8. 


4.1 Probability Density Function 


4.1.1 Some intuitions about probability density functions 


Let’s begin by outlining some intuitive reasoning, which is needed to define the probability 
of continuous events properly. These intuitions are based on the fact that probability is a 
measure. In the following discussion you will see a sequence of logical arguments for con- 
structing such a measure for continuous events. Some arguments are discussed in Chapter 2, 
but now we place them in the context of continuous random variables. 

Suppose we are given an event A that is a subset in the sample space 2, as illustrated 
in Figure 4.1. In order to calculate the probability of A, the measure perspective suggests 
that we consider the relative size of the set 


“size” of A 
“size” of Q” 


P[{« € A}] = 


The right-hand side of this equation captures everything about the probability: It is a 
measure of the size of a set. It is relative to the sample space. It is a number between 0 and 
1. It can be applied to discrete sets, and it can be applied to continuous sets. 

How do we measure the “size” of a continuous set? One possible way is by means of 
integrating the length, area, or volume covered by the set. Consider an example: Suppose 
that the sample space is the interval 2 = [0,5] and the event is A = [2,3]. To measure the 
“size” of A, we can integrate A to determine the length. That is, 


“size” of Af, dz _ i dx 1 


P 2, 3]}] = 7 7 - 
{a E [ ,3]}] “cize” of Q i dx i. dx 5 


Therefore, we have translated the “size” of a set to an integration. However, this definition 
is a very special case because when we calculate the “size” of a set, we treat all the elements 
in the set with equal importance. This is a strong assumption that will be relaxed later. But 
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A Q 


Figure 4.1: [Left] An event A in the sample space 2. The probability that A happens can be calculated 
as the “size” of A relative to the “size” of 2. [Right] A specific example on the real line. Note that the 
same definition of probability applies: The probability is the size of the interval A relative to that of the 
sample space Q2. 


if you agree with this line of reasoning, we can rewrite the probability as 


Jy da _ Sa de 
Jq de | 
1 
= i dx. 
| | 
—S 


equally important over Q 


Pl{a € A}] = 


This equation says that under our assumption (that all elements are equiprobable), the 
probability of A is calculated as the integration of A using an integrand 1/|Q] (note that 
1/|Q] is a constant with respect to x). If we evaluate the probability of another event B, all 
we need to do is to replace A with B and compute J, a dx. 

What happens if we want to relax the “equiprobable” assumption? Perhaps we can 
adopt something similar to the probability mass function (PMF). Recall that a PMF px 
evaluated at a point x is the probability that the state x happens, i.e., px(x) = P[X = a]. 
So, px (2) is the relative frequency of x. Following the same line of thinking, we can define a 
function fx such that fx (x) tells us something related to the “relative frequency”. To this 
end, we can treat fx as a continuous histogram with infinitesimal bin width as shown in 
Figure 4.2. Using this fx, we can replace the constant function 1/|Q| with the new function 
fx (a). This will give us 


Pitz € Ay = f fx(a) — dev. (4.1) 
“replace 1/(0 


If we compare it with a PMF, we note that when X is discrete, 


Pl{x € A}] = S- px (2) 


ZEA 


Hence, fx can be considered a continuous version of px, although we do not recommend 
this way of thinking for the following reason: px (x) is a legitimate probability, but fx (x) is 
not a probability. Rather, fx is the probability per unit length, meaning that we need to 
integrate fx (times dx) in order to generate a probability value. If we only look at fx at 
a point x, then this point is a measure-zero set because the length of this set is zero. 

Equation (4.1) should be familiar to you from Chapter 2. The function fx (a) is pre- 
cisely the weighting function we described in that chapter. 
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Q 


Figure 4.2: [Left] A probability mass function (PMF) tells us the relative frequency of a state when 
computing the probability. In this example, the “size” of A is px (x2) + px(a3). [Right] A probability 
density function (PDF) is the infinitesimal version of the PMF. Thus, the “size” of A is the integration 
over the PDF. 


What is a PDF? 
e A PDF is the continuous version of a PMF. 
e We integrate a PDF to compute the probability. 


e We integrate instead of sum because continuous events are not countable. 


To summarize, we have learned that when measuring the size of a continuous event, 
the discrete technique (counting the number of elements) does not work. Generalizing to 
continuous space requires us to integrate the event. However, since different elements in an 
event have different relative emphases, we use the probability density function fx (a) to tell 
us the relative frequency for a state x to happen. This PDF serves the role of the PMF. 


4.1.2 More in-depth discussion about PDFs 


A continuous random variable X is defined by its probability density function fx. This 
function has to satisfy several criteria, summarized as follows. 


Definition 4.1. A probability density function fx of a random variable X is a map- 
ping fx : QR, with the properties 


e Non-negativity: fx (x) >0 for alla €Q 
e Unity: [, fx(x) dx =1 
e Measure of a set: P[{x € A}] = f, fx(x) dx 


If all elements of the sample space are equiprobable, then the PDF is f(x) = 1/|Q|. You can 
easily check that it satisfies all three criteria. 
Let us take a closer look at the three criteria: 


e Non-negativity: The non-negativity criterion fx (x) > 0 is reminiscent of Probability 
Axiom I. It says that no matter what x we are looking at, the probability density 


function fx evaluated at x should never give a negative value. Axiom I ensures that 
we will not get a negative probability. 
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e Unity: The unity criterion f, f(x) dz = 1 is reminiscent of Probability Axiom II, 
which says that measuring over the entire sample space will give 1. 


e Measure of a set: The third criterion gives us a way to measure the size of an event A. 
It says that since each x € 2 has a different emphasis when calculating the size of 
A, we need to scale the elements properly. This scaling is done by the PDF fx(z), 
which can be regarded as a histogram with a continuous x-axis. The third criterion 
is a consequence of Probability Axiom II, because if there are two events A and B 
that are disjoint, then P[{w € A}U {x € B}] = J, fx(x) dx + J, fx(x) dx because 
fx(x) > 0 for all x. 


If the random variable X takes real numbers in 1D, then a more “user-friendly” definition 
of the PDF can be given. 


Definition 4.2. Let X be a continuous random variable. The probability density 
function (PDF) of X is a function fx :Q— R that, when integrated over an interval 
[a, b], yields the probability of obtaining a < X <b: 


b 
Plax xX <) =| fx (ax) da. (4.2) 


This definition is just a rewriting of the previous definition by explicitly writing out 
the definition of A as an interval [a,b]. Here are a few examples. 


Example 4.1. Let fx(x) = 3x? with Q = [0,1]. Let A = [0,0.5]. Then the probability 
PI{X € A}] is 


Example 4.2. Let fx(x) = 1/|Q| with Q = [0,5]. Let A = [3,5]. Then the probability 
P[{X € A}] is 


Example 4.3. Let fx(x) = 2x with 2 = [0,1]. Let A = {0.5}. Then the probability 
P[{X € A}] is 


0.5 
0.5] = P[0.5 < X < 0.5] = i 2a0r —10: 
0.5 
This example shows that evaluating the probability at an isolated point for a contin- 
uous random variable will yield 0. 
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Practice Exercise 4.1. Let X be the phase angle of a voltage signal. Without any 
prior knowledge about X we may assume that X has an equal probability of any value 
between 0 to 27. Find the PDF of X and compute P[0 < X < 1/2]. 


Solution. Since X has an equal probability for any value between 0 to 27, the PDF 
of X is 


1 
xo — for 0 <a < 2rn. 
20 


Therefore, the probability P[0 < X < 2/2] can be computed as 


ep<x<9] 


Looking at Equation (4.2), you may wonder: If the PDF fx is analogous to PMF 
px, why didn’t we require 0 < fx(x) < 1 instead of requiring only fx(a) > 0? This is 
an excellent question, and it points exactly to the difference between a PMF and a PDF. 
Notice that fx is a mapping from the sample space 2 to the real line R. It does not map 
Q to [0,1]. On the other hand, since px (a) is the actual probability, it maps © to [0,1]. 
Thus, fx(x) can take very large values but will not explode, because we have the unity 
constraint [, fx(#) da = 1. Even if fx (x) takes a large value, it will be compensated by the 
small dx. If you recall, there is nothing like dx in the definition of a PMF. Whenever there 
is a probability mass, we need to sum or, putting it another way, the dz in the discrete case 
is always 1. Therefore, while the probability mass PMF must not exceed 1, a probability 
density PDF can exceed 1. 


If fx(x) > 1, then what is the meaning of fx(x)? Isn’t it representing the probability 
of having an element X = x? If it were a discrete random variable, then yes; px (x) is the 
probability of having X = x (so the probability mass cannot go beyond 1). However, for a 
continuous random variable, fx (a) is not the probability of having X = x. The probability 
of having X = « (i.e., exactly at x) is 0 because an isolated point has zero measure in the 
continuous space. Thus, even though fx(a) takes a value larger than 1, the probability of 
X being x is zero. 


At this point you can see why we call PDF a density, or density function, because each 
value fx(x) is the probability per unit length. If we want to calculate the probability of 
x<X <a+6, for example, then according to our definition, we have 


xt+o 
Pie <X<o+)= [ fx(a) du & fx(ax)-0. 
Therefore, the probability of Pla < X < «+ 6] can be regarded as the “per unit length” 
density fx (x) multiplied with the “length” 6. As 6 + 0, we can see that P|X = a] = 0. See 
Figure 4.3 for an illustration. 


Why are PDFs called a density function? 


e Because fx (x) is the probability per unit length. 


e You need to integrate fx (x) to obtain a probability. 
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——+5— 


Figure 4.3: The probability P[a < X < x-+ 6] can be approximated by the density fx (x) multiplied by 
the length 6. 


Example 4.4. Consider a random variable X with PDF fx(x#) = ae for any 
0 <a <1, and is 0 otherwise. We can show that fx(z) > oo as x — 0. However, 


fx (x) remains a valid PDF because 


Remark. Since isolated points have zero measure in the continuous space, the probability 
of an open interval (a,b) is the same as the probability of a closed interval: 


Plla, 6] = P{(a, b)] = P{(a, bl] = Pl[a, 6)]. 


The exception is that when the PDF of fx(x) has a delta function at a or b. In this case, 
the probability measure at a or b will be non-zero. We will discuss this when we talk about 
the CDFs. 


Practice Exercise 4.2. Let fx (x) = c(1—2?) for -1 < x < 1, and 0 otherwise. Find 
the constant c. 


Solution. Since [, fx(x) dx = 1, it follows that 


4c 


[ sxe) ae= fc 2%) av=* =e 


Practice Exercise 4.3. Let fx(x) = x? for |x| < a, and 0 otherwise. Find a. 


Solution. Note that 
a a3 a 
[ tx@ax= f 2 
wo —a 3 


Setting da =1 yields a= 
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4.1.3 Connecting with the PMF 


The probability density function is more general than the probability mass function. To see 
this, consider a discrete random variable X with a PMF px(a). Because px is defined on 
a countable set 9, we can write it as a train of delta functions and define a corresponding 


PDF: 
f(x) = So px(wx) 6(w - 24). 


LpREQ 


Example 4.5. If X is a Bernoulli random variable with PMF px (1) = p and px(0) = 
1 — p, then the corresponding PDF can be written as 


fx(x) =p d(x 


Example 4.6. If X is a binomial random variable with PMF px (k) = (7)p*(1—p)""*, 


then the corresponding PDF can be written as 
fx(x) =) px(k) 6(a — k) 
k=0 


2 3 (j,)oh(a = 2y"* 5a —B). 


Strictly speaking, delta functions are not really functions. They are defined through 
integrations. They satisfy the properties that 6(a — x,) = oo if = xz, O(a — vy) = O if 


x Axx, and 
Lpte 
i d(a — av,) dx = 1, 


k—€ 
for any € > 0. Suppose we ignore the fact that delta functions are not functions and merely 
treat them as ordinary functions with some interesting properties. In this case, we can 
imagine that for every probability mass px(a,), there exists an interval [a,b] such that 
there is one and only one state x, that lies in [a,b], as shown in Figure 4.4. 


a %&r b = 


Figure 4.4: We can view a PMF as a train of impulses. When computing the probability X = x,, we 
integrate the PMF over the interval [a, b]. 
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If we want to calculate the probability of obtaining X = x,, we can show that 


b 
@ / px (wx) O(a — ap) de 
(9) ; _ 
= px(tr) | 6(@ — rp) dx = px (zr). 
i tS ae. 
=1 


Here, step (a) holds because within [a,b], there is no other event besides X = x. Step (b) 
is just the definition of our fx (x) (inside the interval [a, b]). Step (c) shows that the delta 
function integrates to 1, thus leaving the probability mass px (zz) as the final result. Let us 
look at an example and then comment on this intuition. 


Example 4.7. Let X be a discrete random variable with PMF 


The continuous representation of the PMF can be written as 


fle) = Spx (8) 6-8) = (5 
k=1 k=1 


Suppose we want to compute the probability P[1 < X < 2]. This can be computed as 


However, if we want to compute the probability P[l < X < 2], then the integration 
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limit will not include the number 1 and so the delta function will remain 0. Thus, 


2 
Pile <2) — fx (a) da 
lla 


Closing remark. To summarize, we see that a PMF can be “regarded” as a PDF. We are 
careful to put a quotation around “regarded” because PMF and PDF are defined for different 
events. A PMF uses a discrete measure (i.e., a counter) for countable events, whereas a PDF 
uses a continuous measure (i.e., integration) for continuous events. The way we link the two is 
by using the delta functions. Using the delta functions is valid, but the argument we provide 
here is intuitive rather than rigorous. It is not rigorous because the integration we use is still 
the Riemann-Stieltjes integration, which does not handle delta functions. Therefore, while 
you can treat a discrete PDF as a train of delta functions, it is important to remember the 
limitations of the integrations we use. 


4.2 Expectation, Moment, and Variance 


4.2.1 Definition and properties 


As with discrete random variables, we can define expectation for continuous random vari- 
ables. The definition is analogous: Just replace the summation with integration. 


Definition 4.3. The expectation of a continuous random variable X is 


LX] = [ 2 fx(o) dx. 


Example 4.8. (Uniform random variable) Let X be a continuous random variable 


with PDF fx(x) = ,4, for a < x < 6, and 0 otherwise. The expectation is 


id= f atx) de= fo * 
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Example 4.9. (Exponential random variable) Let X be a continuous random variable 
with PDF fx(x) = Ae~*, for z > 0. The expectation is 


where the colored step is due to integration by parts. 


If a function g is applied to the random variable X, the expectation can be found using 
the following theorem. 


Theorem 4.1. Let g:Q— R be a function and X be a continuous random variable. 
Then 


E[9(X)] = if OO ee (4.4) 


Example 4.10. (Uniform random variable) Let X be a continuous random variable 
with fx(x) = ;+ for a < x <b, and 0 otherwise. If g(-) = (-)?, then 


i[X?] = [ Ptx(@ da 


_ a? +ab+ 6b? 


Practice Exercise 4.4. Let © be a continuous random variable with PDF fe(é) = 
for 0 < @ < 27 and is 0 otherwise. Let Y = cos(wt + 9). Find E[Y]. 


Solution. Referring to Equation (4.4), the function g is 


g(0) = cos(wt + 6). 
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Therefore, the expectation E[Y] is 


where the last equality holds because the integral of a sinusoid over one period is 0. 


aa ie cos(wt + 6) fe(@) dé 


1 20 
= xf cos(wt + @) dO = 0, 
27 0 


Practice Exercise 4.5. Let A C 9. Let I4(X) be an indicator function such that 


Find E[l4(X)}. 


een 
ifX ZA. 


Solution. The expectation is 


Sfla(X)] = i La(a)fix(o) do 


= fx (x) dx 


ZEA 
= P[X € A]. 


So the probability of {X € A} can be equivalently represented in terms of expectation. 


Practice Exercise 4.6. Is it true that E 


Solution. No. This is because 


All the properties of expectation we learned in the discrete case can be translated to 


the continuous case. 


e ElaX] = aE[X 


Specifically, we have that 


: A scalar multiple of a random variable will scale the expectation. 


e E|X+a] = E[X]+a: Constant addition of a random variable will offset the expectation. 


e ElaX +b] =a 


#|X]-+ b: Affine transformation of a random variable will translate to 


the expectation. 
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Practice Exercise 4.7. Prove the above three statements. 


Solution. The third statement is just the sum of the first two statements, so we just 
need to show the first two: 


4[aX] = f axfx(o) dx =a f ofx() dx = aE[X], 


IX +a] = f (e+ a)fx(a) de= f ofx(a) de +a= ) 


4.2.2 Existence of expectation 


As we discussed in the discrete case, not all random variables have an expectation. 


Definition 4.4. A random variable X has an expectation if it is absolutely integrable, 
foCes 


{|X |] = [ |x| fx (a) dz < oo. (4.5) 


Being absolutely integrable implies that the expectation is that E||X|] is the upper 
bound of E[X]. 


Theorem 4.2. For any random variable X, 


E[X]| < E[|X]]. 


Proof. Note that fx(x) > 0. Therefore, 
—|a| fx(a) <@ fx(a) <|al,fx(a), Vo. 
Thus, integrating all three terms yields 
— f altel) aes fx fx(e) ae < f \elfx(e) ae, 
Q Q Q 


which is equivalent to —E[|X|] < ELX] < E||X]]. 


Example 4.11. Here is a random variable whose expectation is undefined. Let X be 


a random variable with PDF 


il 


———— ER. 
mits)’ ~ 


fx (2) 


This random variable is called the Cauchy random variable. We can show that 


e 1 fee = 17° r 
“|X| = : = | dz. 
rd ie (1+ 2?) oe =| (1+ 2?) oe tees i 
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The first integral gives 


——_— dx = —log(1+2”)| = 
it (+ 23) a= og(1 + 2“) : 00, 


and the second integral gives —oo. Since neither integral is finite, the expectation is 
undefined. We can also check the absolutely integrability criterion: 


(X= [lel oy ae 


where in (a) we use the fact that the function being integrated is even, and in (b) we 


1 i 3 
lower-bound eee = os eg? all, 


4.2.3 Moment and variance 


The moment and variance of a continuous random variable can be defined analogously to 
the moment and variance of a discrete random variable, replacing the summations with 
integrations. 


Definition 4.5. The kth moment of a continuous random variable X is 


Ie |= ie fx (x) da. 


Definition 4.6. The variance of a continuous random variable X is 


Var[X] = EX — 1)?] = f (e — w)Pfx(a) de, 


def 
where up = E 


It is not difficult to show that the variance can also be expressed as 


Var[X] = E[X*] — p”, 


because 


Var[X] = E[(X — y)?] 
= E[X?] — 2ELX]y + p? 
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Practice Exercise 4.8. (Uniform random variable) Let X be a continuous random 
variable with PDF fx(x) = ;+ for a <x <b, and 0 otherwise. Find Var[X]. 


Solution. We have shown that ELX] = *¢° and E[X?] = aa Therefore, the 
variance is 


Var[X] = E[X?] — E[X]? 


_ @+abte? a 
2 


Practice Exercise 4.9. (Exponential random variable) Let X be a continuous ran- 
dom variable with PDF fx(x) = Ae~*” for x > 0, and 0 otherwise. Find Var[X]. 


Solution. We have shown that ELX] = +. The second moment is 


Therefore, 


4.3. Cumulative Distribution Function 


When we discussed discrete random variables, we introduced the concept of cumulative 
distribution functions (CDFs). One of the motivations was that if we view a PMF as a train 
of delta functions, they are technically not well-defined functions. However, it turns out that 
the CDF is always a well-defined function. In this section, we will complete the story by first 
discussing the CDF for continuous random variables. Then, we will come back and show 
you how the CDF can be derived for discrete random variables. 
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4.3.1 CDF for continuous random variables 


Definition 4.7. Let X be a continuous random variable with a sample space X= R. 
The cumulative distribution function (CDF) of X is 


Fy (a) ue) P[X < x] = [ fx(z’) dz’. (4.9) 


The interpretation of the CDF can be seen from Figure 4.5. Given a PDF fx, the CDF 
Fy evaluated at x is the integration of fx from —oo up to a point x. The integration of fx 
from —oo to x is nothing but the area under the curve of fx. Since fx is non-negative, the 
larger value x we use to evaluate in F(x), the more area under the curve we are looking 
at. In the extreme when « = —oo, we can see that F'x(—oo) = 0, and when x = +00 we 
have that Fx(+oo) = f°. fx(x) dx = 1. 


0. 


4 


4 
fx (x) 


0.3; 


| Fx (x) 


0.8 


0.6} 


0.4; 


5 0 5 10 5 0 5 10 


Figure 4.5: A CDF is the integral of the PDF. Thus, the height of a stem in the CDF corresponds to 
the area under the curve of the PDF. 


Practice Exercise 4.10. (Uniform random variable) Let X be a continuous random 


variable with PDF fx(x) = ;+ fora < x < b, and is 0 otherwise. Find the CDF of X. 


Solution. The CDF of X is given by 


0 


Fxe(0) = 4 J, F(a") da! 
1 


2 


As you can see from this practice exercise, we explicitly break the CDF into three segments. 
The first segment gives Fx (a) = 0 because for any x < a, there is nothing to integrate, 
since fx(z) = 0 for any x < a. Similarly, for the last segment, Fx(a) = 1 for all x > b 
because once x goes beyond 6, the integration will cover all the non-zeros of fx. Figure 4.6 
illustrates the PDF and CDF for this example. 

In MATLAB, we can generate the PDF and CDF using the commands pdf and cdf 
respectively. For the particular example shown in Figure 4.6, the following code can be used. 
A similar set of commands can be implemented in Python. 
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0.4 
1 
! 
1 
0.3 1 
! 
i] 
0.2} ; 
1 
1 
0.1 : 
! 
! 
0 . 0 + 
5 0 5 10 5 | 0 1 5 10 
u<a asxr<b r>b r<a a<a<b xz>b 


Figure 4.6: Example: fx (x) = 1/(b— a) fora < x < b. The CDF has three segments. 


% MATLAB code to generate the PDF and CDF 
unif = makedist(’Uniform’,’lower’ ,-3,’upper’ ,4); 
= linspace(-5, 10, 1500)’; 
pdf(unif, x); 
cdf(unif, x); 
figure(1); plot(x, f, ’LineWidth’, 6); 
figure(2); plot(x, F, ’LineWidth’, 6); 


# Python code to generate the PDF and CDF 
import numpy as np 
import matplotlib.pyplot as plt 
import scipy.stats as stats 
np.linspace(-5,10,1500) 
= stats.uniform.pdf(x,-3,4) 
= stats.uniform.cdf(x,-3,4) 
-plot(x,f); plt.show() 
-plot(x,F); plt.show() 


Practice Exercise 4.11. (Exponential random variable) Let X be a continuous 
random variable with PDF fx (x) = \e~*” for « > 0, and 0 otherwise. Find the CDF 
of X. 


Solution. Clearly, for x < 0, we have Fx (x) = 0. For x > 0, we can show that 


Fx (x) = i fx(a’) da’ -| New’? dy l= er: 


Therefore, the complete CDF is (see Figure 4.7 for illustration): 


de< 0), 
a = (O. 
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0.6 1 


fx (x) 


0.4; 


0.2} 


0 5 10 5 0 5 10 
x <0 x>0 <0 xz>0 


Figure 4.7: Example: fx (x) = \e~*” for « > 0. The CDF has two segments. 


The MATLAB code and Python code to generate this figure are shown below. 


% MATLAB code to generate the PDF and CDF 
= makedist(’exp’ ,2); 
linspace(-5, 10, 1500)’; 
pdf (pd, x); 
cdf (pd, x); 
figure(1); plot(x, f, ’LineWidth’, 6); 
figure(2); plot(x, F, ’LineWidth’, 6); 


# Python code to generate the PDF and CDF 
import numpy as np 
import matplotlib.pyplot as plt 
import scipy.stats as stats 
np.linspace(-5,10,1500) 
= stats.expon.pdf (x, 2) 
= stats.expon.cdf (x, 2) 
-plot(x,f); plt.show() 
-plot(x,F); plt.show() 


4.3.2 Properties of CDF 


Let us now describe the properties of a CDF. If we compare these with those for the discrete 
cases, we see that the continuous cases simply replace the summations by integrations. 
Therefore, we should expect to inherit most of the properties from the discrete cases. 


Proposition 4.1. Let X be a random variable (either continuous or discrete), then 
the CDF of X has the following properties: 
(i) The CDF is nondecreasing. 


(ii) The maximum of the CDF is when x = 00: Fx(+o00) = 1. 


(itt) The minimum of the CDF is when « = —oo: Fxy(—oo) = 0. 
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Proof. For (i), we notice that Fx(x) = [*,. fx(a’) dz’. Therefore, if s < t then 


Fx(s) = - fee) ds / fx(e’) de’ = Fx(). 


Thus it shows that Fy is nondecreasing. (It does not need to be increasing because a CDF 
can have a steady state.) For (ii) and (iii), we can show that 


+00 =e 
Fx (+00) = / fx(2’) dz’ =1, and Fx(-oo) = fee’) de =0. 


—co —oco 


Example 4.12. We can show that the CDF we derived for the uniform random variable 
satisfies these three properties. To see this, we note that 


a<xaz<ob. 


The derivative of this function F(z) = > 0 fora < x < b. Also, note that 
Fx (x) = 0 for « < a and x > b, so Fx is nondecreasing. The other two properties 


follow because if x = b, then Fx (b) = 1, and if e = a then F'x(a) = 0. Together with 
the nondecreasing property, we show (ii) and (iii). 


Proposition 4.2. Let X be a continuous random variable. If the CDF Fx is contin- 
uous at anya <a <b, then 


Pla < X <b] = Fx(6) — Fx(a). (4.10) 


Proof. The proof follows from the definition of the CDF, which states that 


b a 
Fx (b) = Fx (a) = i: fx(2’) da’ — / fx(2’) da! 


= [sx de = Fins x <b), 


This result provides a very handy tool for calculating the probability of an event 
a< X <b} using the CDF. It says that Pla < X < }] is the difference between Fx (b) and 
Fy (a). So, if we are given Fy, calculating the probability of a < X < 6b just involves 
evaluating the CDF at a and b. The result also shows that for a continuous random vari- 
able X, P[X = x0] = Fx (xo) — Fx (x0) = 0. This is consistent with our arguments from the 
measure’s point of view. 


Example 4.13. (Exponential random variable) We showed that the exponential ran- 
dom variable X with a PDF fx(x) = Xe~** for x > 0 (and fx(x) = 0 for x < 0) 


has a CDF given by Fx(x) = 1— e~>* for x > 0. Suppose we want to calculate the 
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probability P[1 < X <3]. Then the PDF approach gives us 
3 3 3 
Pl<xX <3)/= i ipo) the — i er dee |, Se ae 
1 1 1 


If we take the CDF approach, we can show that 


Pll < X <3] = Fx(3) — Fx(1) 


Sie) = (12 3A) =e 3A __ 6 as 


which yields the same as the PDF approach. 


Example 4.14. Let X be a random variable with PDF fx(a) = 2x for 0 < a < 1, 
and is 0 otherwise. We can show that the CDF is 


x 


F(a) = | fx(t)ae= | 2 des (sel. 
0 


Therefore, to compute the probability P[1/3 < X < 1/2], we have 
2 2 
1 1 1 1 5 
(s)- (3) = (3) -G) =a 


A CDF can be used for both continuous and discrete random variables. However, before 
we can do that, we need a tool to handle the discontinuities. The following definition is a 
summary of the three types of continuity. 


<x <3] 


Definition 4.8. A function Fx (x) is said to be 


e Left-continuous at x =b if Fx(b) = Fx(b-) © limy_59 Fx(b—h); 
e Right-continuous at x = b if Fx(b) = Fx (b+) © limy_4 Fx(b +h); 


e Continuous at x = b if it is both right-continuous and left-continuous at x = b. 
In this case, we have 


lim Fx (b— h) = lim Fx (6 +h) = F(b). 


In this definition, the step size h > 0 is shrinking to zero. The point b—h stays at the left of 
b, and b+h stays at the right of b. Thus, if we set the limit h — 0, b—h will approach a point 
b~ whereas b+ h will approach a point bt. If it happens that Fy (b~) = Fy (b) then we say 
that Fx is left-continuous at b. If Fx (bt) = Fx (b) then we say that Fy is right-continuous 
at b. These are summarized in Figure 4.8. 

Whenever Fx has a discontinuous point, it can be left-continuous, right-continuous, or 
neither. (“Neither” happens if Fy (b) take a value other than Fy (b+) or Fx (b~). You can 
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lim Fy (b + I i . 
a mma FxO +2) 


/ 


ue 
lim Fx(b—h) lim Fx (b—h) 
Right continuous Left continuous 


Figure 4.8: The definition of left- and right-continuous at a point b. 


always create a nasty function that satisfies this condition.) For continuous functions, it is 
necessary that Fy(b~) = Fx (bt). If this happens, there is no gap between the two points. 


Theorem 4.3. For any random variable X (discrete or continuous), Fy (x) is always 
right-continuous. That 1s, 


ia) ea) eS lim Fx(b +h) (4.11) 


Right-continuous means that if F(a) is piecewise, it must have a solid left end and an 
empty right end. Figure 4.9 shows an example of a valid CDF and an invalid CDF. 


a aes a =“ Pe 
oe Sais 


rs rs 


Figure 4.9: A CDF must be right-continuous. 


The reason why Fx is always right-continuous is that the inequality X < x has a 
closed right-hand limit. Imagine the following situation: A discrete random variable X has 
four states: 1,2,3,4. Then, 


lim Fx(3 +h)= jim », px(k) = px(1) + px (2) + px (3) = Fx (3). 


Similarly, if you have a continuous random variable X with a PDF fx, then 
b+h b 
jim Fx (6+ h) = lim fx(t) a= f fx (t) dt = Fx (b). 
mes —oo 
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In other words, the “<” ensures that the rightmost state is included. If we defined CDF 
using <, we would have gotten left-hand continuous, but this would be inconvenient because 
the < requires us to deal with limits whenever we evaluate X < 2. 


Theorem 4.4. For any random variable X (discrete or continuous), PLX = b] is 


Fx(b)—Fx(b-), if Fx is discontinuous at x = b 


: (4.12) 
0, otherwise. 


rx =H=| 


This proposition states that when F'x(x) is discontinuous at « = b, then P[X = )] is 
the difference between F'x(b) and the limit from the left. In other words, the height of the 
gap determines the probability at the discontinuity. If Fx (a) is continuous at 2 = b, then 
Fx (b) = limp_40 Fx (b — h) and so P[X = 6] = 0. 


Fx (b) 


| P[X =] = Fx(b) — Fx (6) 


Fx(b-) 


Figure 4.10: Illustration of Equation (4.12). Since the CDF is discontinuous at a point x = b, the gap 
Fx (b) — Fx (6) will define the probability PLX = 0]. 


Example 4.15. Consider a random variable X with a PDF 


Gc Osa <1, 

4, t= 3, 

0, otherwise. 

The CDF F'x(x) will consist of a few segments. The first segment is 0 < x < 1. We 
can show that 


x z #2 x 
Fx(o) =f fe(tat= ft a= > = Meee, 
0 0 2 0 


The second segment is when 1 < x < 3. Since there is no new fx to integrate, the 
CDF stays at Fx (x) = Fx(1) = § for 1 < x < 3. The third segment is x > 3. Because 
this range has covered the entire sample space, we have F(a) = 1 for « > 3. How 
about + = 3? We can show that 


PyGy= ye") =, 
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Therefore, to summarize, the CDF is 


tT 0; 

OS a <r 

ES ee). 
; de ee ys 


A graphical illustration is shown in Figure 4.11. 


fx(z) | Fx (a) 


1 2 


Figure 4.11: An example of converting a PDF to a CDF. 


4.3.3 Retrieving PDF from CDF 


Thus far, we have only seen how to obtain Fx (a) from fx (a). In order to go in the reverse 
direction, we recall the fundamental theorem of calculus. This states that if a function f is 
continuous, then 


fa) = [peat 


for some constant a. Using this result for CDF and PDF, we have the following: 


Theorem 4.5. The probability density function (PDF) is the derivative of the cu- 
mulative distribution function (CDF): 


= dFx (x) — 


Ole =< i Fegan, (4.13) 


provided Fx is differentiable at x. If Fx is not differentiable at x = xo, then, 


Example 4.16. Consider a CDF 


G05 
xr>0. 


We want to find the PDF fx(z). To do so, we first show that Fx(0) = 3. This 


193 


CHAPTER 4. CONTINUOUS RANDOM VARIABLES 


corresponds to a discontinuity at « = 0, as shown in Figure 4.12. 


0.8 1 

0.7; . 0.9} a | 
0.8} 

0.6 07+ 


0.5} 
0.4} 
0.3} 
0.2} 
0.1} 
0 


Figure 4.12: An example of converting a PDF to a CDF. 
Because of the discontinuity, we need to consider three cases: 


dx ? 


When z <0, Fx(z) = 0, so #*@ —9, 
When z > 0, Fx(x) =1—- $e~?”, so 
dFx (x) 
dx 


When x = 0, the probability P[X = 0] is determined by the gap between the 
solid dot and the empty dot. This yields 


P[X = 0] = Fx(0) — lim Fx (0 ~h) 
3 


Therefore, the overall PDF is 


Figure 4.12 illustrates this example. 


4.3.4 CDF: Unifying discrete and continuous random variables 


The CDF is always a well-defined function. It is integrable everywhere. If the underlying 
random variable is continuous, the CDF is also continuous. If the underlying random variable 
is discrete, the CDF is a staircase function. We have seen enough CDFs for continuous 
random variables. Let us (re)visit a few discrete random variables. 
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Example 4.17. (Geometric random variable) Consider a geometric random variable 
with PMF px(k) = (1 —p)*~!9, for k = 1,2,.... 
0.5 4 


px(k) 


0.4 0.8 


0.3 0.6 
0.2 0.4 


0.1 0.2 


0 


Figure 4.13: PMF and CDF of a geometric random variable. 


We can show that the CDF is 


(1—p)/'p 


1S ee 
(ea) 
=1-(1—p)*. 


For a sanity check, we can try to retrieve the PMF from the CDF: 


— a 


px(k) = Fx(k) — Fx(k—1) 
(1 —p)*) —(1-(1-p)*“’) 
= (1—p)* "1p. 


A graphical portrayal of this example is shown in Figure 4.13. 


If we treat the PMFs as delta functions in the above example, then the continuous 
definition also applies. Since the CDF is a piecewise constant function, the derivative is 
exactly a delta function. For some problems, it is easier to start with CDF and then compute 
the PMF or PDF. Here is an example. 


Example 4.18. Let X,, X2 and X3 be three independent discrete random variables 
with sample space Q = {1,2,...,10}. Define X = max{X,, X2, X3}. We want to 
find the PMF of X. To tackle this problem, we first observe that the PMF for X, is 


px,(k) = io: Thus, the CDF of X, is 
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Then, we can show that the CDF of X is 


Fx(k) = P[X < k] = Plmax{X1, Xo, X3} < k] 


© Dixy <k NX2<kNX3<K 


2 P[LX1 < kIP[X2 < kIP[X3 < ki] 


(i) 


where in (a) we use the fact that max{X), X2, X3} < k if and only if all three elements 
are less than k, and in (b) we use independence. Consequently, the PMF of X is 


k\? Rat 
10 10 ) ° 


What is a CDF? 
e CDF is Fx (a) = P[X < a]. It is the cumulative sum of the PMF/PDF. 


e CDF is either a staircase function, a smooth function, or a hybrid. Unlike a 
PDF, which is not defined for discrete random variables, the CDF is always well 
defined. 


d 
e CDF > PDF. 


° CDF «© ppr. 


e Gap of jump in CDF = height of delta in PDF. 


4.4 Median, Mode, and Mean 


There are three statistical quantities that we are frequently interested in: mean, mode, and 
median. We all know how to compute these from a dataset. For example, to compute the 
median of a dataset, we sort the data and pick the number that sits in the 50th percentile. 
However, the median computed in this way is the empirical median, i.e., it is a value 
computed from a particular dataset. If the data is generated from a random variable (with 
a given PDF), how do we compute the mean, median, and mode? 


4.4.1 Median 


Imagine you have a sequence of numbers as shown below. 


Te] 
n 1 2 3 4 5 6 7 8 9 +: 100 
% (15 25 31 11 -04 -41 05 2.2 -34 --- —-14 
EEE EE 


How do we compute the median? We first sort the sequence (either in ascending order 
or descending order), and then pick the middle one. On computer, we permute the samples 


{211,Zor,..., 2} = sort{x1,£2,.. te} 


196 


4.4. MEDIAN, MODE, AND MEAN 


such that ry < %g <...< xy is ordered. The median is the one positioned at the middle. 
There are, of course, built-in commands such as median in MATLAB and np.median in 
Python to perform the median operation. 

Now, how do we compute the median if we are given a random variable X with a PDF 
fx (a)? The answer is by integrating the PDF. 


Definition 4.9. Let X be a continuous random variable with PDF fx. The median 
of X is a point cE R such that 


7 lie (aa i. fx(a) daz. (4.14) 


Why is the median defined in this way? This is because ine fx(«) dx is the area under 


the curve on the left of c, and fe fx (a) dx is the area under the curve on the right of c. 
The area under the curve tells us the percentage of numbers that are less than the cutoff. 
Therefore, if the left area equals the right area, then c must be the median. 


How to find the median from the PDF 


e Find a point c that separates the PDF into two equal areas 


median 0.6} 


median 


0 5 
50% 50% 


Figure 4.14: [Left] The median is computed as the point such that the two areas under the curve are 
equal. [Right] The median is computed as the point such that F'x hits 0.5. 


The median can also be evaluated from the CDF as follows. 


Theorem 4.6. The median of a random variable X is the point c such that 


Proof. Since Fy (x) = f",. fx(a’) da’, we have 


Fx(= fo fx(2) ae = [ fx(a) dx =1— Fx(c). 


Rearranging the terms shows that F'x(c) = $. 


197 


CHAPTER 4. CONTINUOUS RANDOM VARIABLES 


How to find median from CDF 
e Find a point c such that F'y(c) = 0.5. 


Example 4.19. (Uniform random variable) Let X be a continuous random variable 
with PDF fx(x) = ;4+. for a< x < 6, and is 0 otherwise. We know that the CDF of 


X is Fx(x) = = * for a < w < b. Therefore, the median of X is the number c € R 


such that Fx (c) = 5. Substituting into the CDF yields =* = 4, which gives c = “4%. 


Example 4.20. (Exponential random variable) Let X be a continuous random vari- 
able with PDF fx(a) = \e~>® for z > 0. We know that the CDF of X is Fx(z) = 


1—e~** for x > 0. The median of X is the point c such that Fy(c) = 4. This gives 
log 2 
ae 


1—e¢ = 5, which is c= 
4.4.2 Mode 
The mode is the peak of the PDF. We can see this from the definition below. 


Definition 4.10. Let X be a continuous random variable. The mode is the point c 
such that fx (x) attains the maximum: 


d 
c=argmax fx(x) = argmaxr —Fy (zx). (4.16) 
rE rEQ dx 


The second equality holds because fx (x) = Fx (x) = # Jo. fx(t) dt. A pictorial illustra- 
tion of mode is given in Figure 4.15. Note that the mode of a random variable is not unique, 
e.g., a mixture of two identical Gaussians with different means has two modes. 


(steepest slope) 


5 0 5 10 5 0 5 10 


Figure 4.15: [Left] The mode appears at the peak of the PDF. [Right] The mode appears at the steepest 
slope of the CDF. 
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How to find mode from PDF 
e Find a point c such that fx(c) is maximized. 
How to find mode from CDF 


e Continuous: Find a point c such that Fx (c) has the steepest slope. 


e Discrete: Find a point c such that Fx (c) has the biggest gap in a jump. 


Example 4.21. Let X be a continuous random variable with PDF fx (a) = 6a(1—2) 
for 0 <a <1. The mode of X happens at argmax fx(x). To find this maximum, we 


take the derivative of fx. This gives 


qq ot — 2) = 6(1—- 22). 


Setting this equal to zero yields x = $. 


To ensure that this point is a maximum, we take the second-order derivative: 


d2 


Therefore, we conclude that x = 4 is a maximum point. Hence, the mode of X is 
1 

Lie 
5) 


4.4.3, Mean 


We have defined the mean as the expectation of X. Here, we show how to compute the 
expectation from the CDF. To simplify the demonstration, let us first assume that X > 0. 


Lemma 4.1. Let X > 0. Then E[X] can be computed from Fx as 


ie (1— Fx(t)) dt. 
0 


Proof. The trick is to change the integration order: 


| (1-Fe(o) ae= f aX <a] a= f P[X > #] dt 


Here, step (a) is due to the change of integration order. See Figure 4.16 for an illustration. 


We draw a picture to illustrate the above lemma. As shown in Figure 4.17, the mean 
of a positive random variable X > 0 is equivalent to the area above the CDF. 
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7s [ic )da dt [fC aca 


0.4 1 


| 0.8} 


area = mean 


mean 


Figure 4.17: The mean of a positive random variable X > 0 can be calculated by integrating the CDF’s 
complement. 


Lemma 4.2. Let X <0. Then E[X] can be computed from Fx as 


E[X] = / ” By(t) de 


—Cco 


Proof. The idea here is also to change the integration order. 


[ roa=f exsa dt = [. [. x(x) de dt 
-{ f[ fx (2) dae = [ xfxl2) dx = E[X]. 


Theorem 4.7. The mean of a random variable X can be computed from the CDF as 


ee) 0 
| (1 — Fx(t)) ar— [ Fx(t) dt. (4.19) 
0 


moO 
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Proof. For any random variable X, we can partition X = X*—X~ where X* and X~ are 
the positive and negative parts, respectively. Then, the above two lemmas will give us 


NP ees Paar ad es 1 ale a 


fore) 0 
=| G-Fe(o)ae- ff Be(oat 


As illustrated in Figure 4.18, this equation is equivalent to computing the areas above 
and below the CDF and taking the difference. 


0.4 1 
x(a) P| 
0.3} : 
0.2} 
0.1+ 
mean = area 1- area2 
0 mn n 0 4. 
5 0 5 10 -5 0 5 10 
mean 


Figure 4.18: The mean of a random variable X can be calculated by computing the area in the CDF. 


How to find the mean from the CDF 


e A formula is given by Equation (4.20): 


co 0 
| Cee (yrds / F(t) dt. (4.20) 
0 —co 


e This result is not commonly used, but the proof technique of switching the inte- 
gration order is important. 


4.5 Uniform and Exponential Random Variables 


There are many useful continuous random variables. In this section, we discuss two of them: 
uniform random variables and exponential random variables. In the next section, we will 
discuss the Gaussian random variables. Similarly to the way we discussed discrete random 
variables, we take a generative / synthesis perspective when studying continuous random 
variables. We assume we have access to the PDF of the random variables so we can derive 
the theoretical mean and variance. The opposite direction, namely inferring the underlying 
model parameters from a dataset, will be discussed later. 
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4.5.1 Uniform random variables 


Definition 4.11. Let X be a continuous uniform random variable. The PDF of X is 


a << 0, 


; (4.21) 
otherwise, 


where |a, b] is the interval on which X is defined. We write 


X ~ Uniform(a, b) 


to mean that X is drawn from a uniform distribution on an interval {a, b]. 


0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 06 08 1 
(a) PDF (b) CDF 


Figure 4.19: The PDF and CDF of X ~ Uniform(0.2, 0.6). 


The shape of the PDF of a uniform random variable is shown in Figure 4.19. In this 
figure, we assume that the random variables X ~ Uniform(0.2,0.6) are taken from the 
sample space 2 = [0,1]. Note that the height of the uniform distribution is greater than 1, 


since 
~ _ = 2.5 0.2<2xr<06 
— J 06-02 ’ Tea, 
fx(@) fs otherwise. 


There is nothing wrong with this PDF, because fx (x) is the probability per unit length. If we 
integrate fx (a) over any sub-interval between 0.2 and 0.6, we can show that the probability 
is between 0 and 1. 

The CDF of a uniform random variable can be determined by integrating fx (x): 
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Therefore, the complete CDF is 


, u<a, 
fy@j=4 =, a<est, 
1, z>b 


The corresponding CDF for the PDF we showed in Figure 4.19(a) is shown in Figure 4.19(b). 
It can be seen that although the height of the PDF exceeds 1, the CDF grows linearly and 
saturates at 1. 


Remark. The uniform distribution can also be defined for discrete random variables. In 
this case, the probability mass function is given by 


1 


px(k) = b= eel? 


k=a,a+l1,...,b. 
The presence of “1” in the denominator of the PMF is because k runs from a to b, including 
the two endpoints. 

In MATLAB and Python, generating uniform random numbers can be done by calling 
commands unifrnd (MATLAB), and stats.uniform.rvs (Python). For discrete uniform 
random variables, in MATLAB the command is unidrnd, and in Python the command is 
stats.randint. 


MATLAB code to generate 1000 uniform random numbers 
= 0; b= 1; 
unifrnd(a,b,[1000,1]); 
hist (X); 


# Python code to generate 1000 uniform random numbers 
import scipy.stats as stats 

a=0; b= 1; 

X = stats.uniform.rvs(a,b,size=1000) 

plt.hist(X); 


To compute the empirical average and variance of the random numbers in MATLAB 
we can call the command mean and var. The corresponding command in Python is np.mean 
and np.var. We can also compute the median and mode, as shown below. 


MATLAB code to compute empirical mean, var, median, mode 
unifrnd(a,b, [1000,1]); 
mean(X) ; 


var (X); 
= median(X) ; 
= mode(X); 


# Python code to compute empirical mean, var, median, mode 
X = stats.uniform.rvs(a,b,size=1000) 

M = np.mean(X) 

V = np.var(X) 
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Med = np.median(X) 
Mod = stats.mode(X) 


The mean and variance of a uniform random variable are given by the theorem below. 


Theorem 4.8. If X ~ Uniform(a, b), then 


a+b 


and Var[X] 


The result should be intuitive because it says that the mean is the midpoint of the 
PDF. 

When will we encounter a uniform random variable? Uniform random variables are one 
of the most elementary continuous random variables. Given a uniform random variable, we 
can construct any random variable by using an appropriate transformation. We will discuss 
this technique as part of our discussion about generating random numbers. 

In MATLAB, computing the mean and variance of a uniform random variable can be 
done using the command unifstat. The Python coommand is stats.uniform.stats. 


% MATLAB code to compute mean and variance 
a=0; b=1; 
[M,V] = unifstat(a,b) 


# Python code to compute mean and variance 
import scipy.stats as stats 

a=0; b= 1; 

M, V = stats.uniform.stats(a,b,moments=’mv’ ) 


To evaluate the probability P/¢é << X < u] for a uniform random variable, we can call 
unifcdf in MATLAB and 


MATLAB code to compute the probability P(0.2 < X < 0.3) 
= 0; b= 1; 


unifcdf(0.3,a,b) - unifcdf(0.2,a,b) 
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# Python code to compute the probability P(0.2 < X < 0.3) 
a=0; b= 1; 


F = stats.uniform.cdf(0.3,a,b)-stats.uniform.cdf(0.2,a,b) 


An alternative is to define an object rv = stats.uniform, and call the CDF attribute: 


# Python code to compute the probability P(0.2 < X < 0.3) 
a=0; b= 1; 

stats.uniform(a,b) 

rv.cdf(0.3)-rv.cdf (0.2) 


Definition 4.12. Let X be an exponential random variable. The PDF of X is 


er. x>0, 
ipel@) = (4.23) 
otherwise, 


where X > 0 is a parameter. We write 
X ~ Exponential(,) 


to mean that X is drawn from an exponential distribution of parameter X. 


In this definition, the parameter \ of the exponential random variable determines the rate 

of decay. A large \ implies a faster decay. The PDF of an exponential random variable is 

illustrated in Figure 4.20. We show two values of X. Note that the initial value fx (0) is 
fx (0) = Ae ** =D. 


Therefore, as long as \ > 1, fx (0) will exceed 1. 
The CDF of an exponential random variable can be determined by 


Fx (a) = i fx (t) dt 
-| se dé= 1—e7-™, x>0. 
0 


Therefore, if we consider the entire real line, the CDF is 


0, xz <0, 
EE): ae xz> 0. 


The corresponding CDFs for the PDFs shown in Figure 4.20(a) are shown in Fig- 
ure 4.20(b). For larger A, the PDF fx (x) decays faster but the CDF F'x (a) increases faster. 
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mem \=2 


0 02 04 06 0.8 1 0 02 0.4 06 08 1 
(a) PDF (b) CDF 


Figure 4.20: (a) The PDF and (c) the CDF of X ~ Exponential(A). 


In MATLAB, the code used to generate Figure 4.20(a) is shown below. There are 
multiple ways of doing this. An alternative way is to call exppdf, which will return the same 
result. In Python, the corresponding command is stats.expon.pdf. Note that in Python 
the parameter is specified in scale option. 


% MATLAB code to plot the exponential PDF 
lambdait = 1/2; lambda2 = 1/5; 

x = linspace(0,1,1000); 

f1 = pdf(’exp’,x, lambdal); 

£2 = pdf(’exp’,x, lambda2); 

plot(x, f1, ’LineWidth’, 4, ’Color’, 
plot(x, £2, ’LineWidth’, 4, ’Color’, 


# Python code to plot the exponential PDF 
lambdi = 1/2 

lambd2 = 1/5 

x = np.linspace(0,1,1000) 

f1 = stats.expon.pdf(x,scale=lambd1) 

£2 = stats.expon. pdf (x,scale=lambd2) 
plt.plot(x, f1) 

plt.plot(x, £2) 


To plot the CDF, we replace pdf by cdf. Similarly, in Python we replace expon. pdf 
by expon.cdf. 


% MATLAB code to plot the exponential CDF 
cdf (’exp’,x, lambda1); 
plot(x, F, ’LineWidth’, 4, ’Color’, [0 0.2 0.8]); 


# Python code to plot the exponential CDF 
F = stats.expon.cdf(x,scale=lambd1) 
plt.plot(x, F) 
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Theorem 4.9. Jf X ~ Exponential(\), then 


1 
and Var|X] = coh 


Proof. We have discussed this proof before. Here is a recap for completeness: 


x)= [. afx(x) dx = [ ve—** dx 


Thus, Var[X] = E[X?] — E[X]? = 3. 


Computing the mean and variance of an exponential random variable in MATLAB and 
Python follows the similar procedures that we described above. 


4.5.3. Origin of exponential random variables 


Exponential random variables are closely related to Poisson random variables. Recall that 
the definition of a Poisson random variable is a random variable that describes the number 
of events that happen in a certain period, e.g., photon arrivals, number of pedestrians, phone 
calls, etc. We summarize the origin of an exponential random variable as follows. 


What is the origin of exponential random variables? 


e An exponential random variable is the interarrival time between two consecutive 
Poisson events. 


e That is, an exponential random variable is how much time it takes to go from N 
Poisson counts to N + 1 Poisson counts. 


An example will clarify this concept. Imagine that you are waiting for a bus, as illus- 
trated in Figure 4.21. Passengers arrive at the bus stop with an arrival rate \ per unit time. 
Thus, for some time t, the average number of people that arrive is At. Let N be a random 
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variable denoting the number of people. We assume that N is Poisson with a parameter At. 
That is, for any duration t, the probability of observing n people follows the PMF 


PIN =n] = ae 


+t 44 7 +t 


N people 


Figure 4.21: For any fixed period of time t, the number of people N is modeled as a Poisson random 


variable with a parameter At. 
T = inter-arrival time toad 


t—+it H+ t++—+—-+# 


Figure 4.22: The interarrival time T’ between two consecutive Poisson events is an exponential random 
variable. 


Let T be the interarrival time between two people, by which we mean the time between 
two consecutive arrivals, as shown in Figure 4.22. Note that T is a random variable because 
T depends on N, which is itself a random variable. To find the PDF of 77, we first find the 
CDF of T. We note that 


P(T > ¢] e) Plinterarrival time > ¢] 


2 P[no arrival in ¢] ie P[N =0] = a“og™, 


In this set of arguments, (a) holds because T is the interarrival time, and (b) holds be- 
cause interarrival time is between two consecutive arrivals. If the interarrival time is larger 
than t, there is no arrival during the period. Equality (c) holds because N is the number of 
passengers. 

Since P[T > t] = 1 — F(t), where F(t) is the CDF of T, we can show that 


Therefore, the interarrival time T’ follows an exponential distribution. 


Since exponential random variables are tightly connected to Poisson random variables, 
we should expect them to be useful for modeling temporal events. We discuss two examples. 
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4.5.4 Applications of exponential random variables 


Example 4.22. (Photon arrivals) Single-photon image sensors are designed to op- 
erate in the photon-limited regime. The number-one goal of using these sensors is to 
count the number of arriving photons precisely. However, for some applications not 
all single-photon image sensors are used to count photons. Some are used to measure 
the time between two photon arrivals, such as time-of-flight systems. In this case, we 
are interested in measuring the time it takes for a pulse to bounce back to the sensor. 
The more time it takes for a pulse to come back, the greater the distance between the 
object and the sensor. Other applications utilize the time information. For example, 
high-dynamic-range imaging can be achieved by recording the time between two pho- 
ton arrivals because brighter regions have a higher Poisson rate \ and darker regions 
have a lower X. 


Low-light 


High-light 


The figure above illustrates an example of high-dynamic-range imaging. When the 
scene is bright, the large \ will generate more photons. Therefore, the interarrival time 
between the consecutive photons will be relatively short. If we plot the histogram of 
the interarrival time, we observe that most of the interarrival time will be concentrated 
at small values. Dark regions behave in the opposite manner. The interarrival time will 
typically be much longer. In addition, because there is more variation in the photon 
arrival times, the histogram will look shorter and wider. Nevertheless, both cases are 
modeled by the exponential random variable. 


Example 4.23. (Energy-efficient escalator) Many airports today have installed variable 
speed escalators. These escalators change their speeds according to the traffic. If there 
are no passengers for more than a certain period (say, 60 seconds), the escalator will 
switch from the full-speed mode to the low-speed mode. For moderately busy esca- 
lators, the variable-speed configuration can save energy. The interesting data-science 
problem is to determine, given a traffic pattern, e.g., the one shown in Figure 4.23, 
whether we can predict the amount of energy savings? 

We will not dive into the details of this problem, but we can briefly discuss the 
principle. Consider a fixed arrival rate A (say, the average from 07:00 to 08:00). The in- 
terarrival time, according to our discussion above, follows an exponential distribution. 
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So we know that 
fr(t) = er 


Suppose that the escalator switches to low-speed mode when the interarrival time 
exceeds 7. Then we can define a new variable Y to denote the amount of time that 
the escalator will operate in the low-speed mode. This new variable is 


y= T —T, IE 
0, IES a 


In other words, if the interarrival time T’ is more than 7, then the amount of time 


saved Y takes the value T — 7, but if the interarrival time is less than 7, then there is 
no saving. 


itt 6tt FH 


0 Pa a 
T 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 


Figure 4.23: The variable-speed escalator problem. [Left] We model the passengers as independent 
Poisson arrivals. Thus, the interarrival time is exponential. [Right] A hypothetical passenger arrival 
rate (number of people per minute), from 06:00 to 23:00. 


Figure 4.24: The escalator problem requires modeling the cutoff threshold 7 such that if T > 7, 
the savings are Y = T'— 7. If T <7, then Y = 0. The left-hand side of the figure shows how the 
PDF of Y is constructed. 


The PDF of Y can be computed according to Figure 4.24. There are two parts 
to the calculation. When Y = 0, there is a probability mass such that 


fy (0) =P[Y =0 =) fr(t) dt = a te" aloe, 
0 0 
For other values of y, we can show that 


fy (y) = frly tr) =Ae°O*), 


Therefore, to summarize, we can show that the PDF of Y is 


fet Ose y =0, 


Ae Ar), y >0. 


210 


4.6. GAUSSIAN RANDOM VARIABLES 


Consequently, we can compute E[Y] and Var[Y] and analyze how these values change 
for \ (which itself changes with the time of day). Furthermore, we can analyze the 
amount of savings in terms of dollars. We leave these problems as an exercise. 


Closing remark. The photon arrival problem and the escalator problem are two of many 
examples we can find in which exponential random variables are useful for modeling a 
problem. We did not go into the details of the problems because each of them requires some 
additional modeling to address the real practical problem. We encourage you to explore these 
problems further. Our message is simple: Many problems can be modeled by exponential 
random variables, most of which are associated with time. 


4.6 Gaussian Random Variables 


We now discuss the most important continuous random variable — the Gaussian random 
variable (also known as the normal random variable). We call it the most important random 
variable because it is widely used in almost all scientific disciplines. Many of us have used 
Gaussian random variables before, and perhaps its bell shape is the first lesson we learn in 
statistics. However, there are many mysteries about Gaussian random variables which you 
may have missed, such as: Where does the Gaussian random variable come from? Why does 
it take a bell shape? What are the properties of a Gaussian random variable? The objective 
of this section is to explain everything you need to know about a Gaussian random variable. 


4.6.1 Definition of a Gaussian random variable 


Definition 4.13. A Gaussian random variable its a random variable X such that its 
PDF is 


{ = (4.25) 


where (1,07) are parameters of the distribution. We write 
X ~ Gaussian(p1, 07) or X~N(p,07) 


to say that X is drawn from a Gaussian distribution of parameter (1,07). 


Gaussian random variables have two parameters (1,07). It is noteworthy that the mean 
is 2 and the variance is 0? — these two parameters are exactly the first moment and the 
second central moment of the random variable. Most other random variables do not have 
this property. 

Note that a Gaussian random variable is positive from —oo to oo. Thus, fx(x) has 
a non-zero value for any x, even though the value may be extremely small. A Gaussian 
random variable is also symmetric about pw. If 4 = 0, then fx (x) is an even function. 

The shape of the Gaussian is illustrated in Figure 4.25. When we fix the variance and 
change the mean, the PDF of the Gaussian moves left or right depending on the sign of the 
mean. When we fix the mean and change the variance, the PDF of the Gaussian changes 
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its width. Since any PDF should integrate to unity, a wider Gaussian means that the PDF 
is shorter. Note also that if o is very small, it is possible that fx(a) > 1 although the 
integration over 2 will still be 1. 


0.5 


0.4; 


0.3 + 


0.2, 


0.1; 


p changes, o = 1 pL. = 0, o changes 


Figure 4.25: A Gaussian random variable with different w and o. 


On a computer, plotting the Gaussian PDF can be done by calling the function 
pdf (’norm’ ,x) in MATLAB, and stats.norm.pdf in Python. 


% MATLAB to generate a Gaussian PDF 
= linspace(-10,10,1000) ; 
0; sigma = 1; 
pdf (’?norm’ ,x,mu,sigma) ; 
plot(x, f); 


# Python to generate a Gaussian PDF 
import numpy as np 
import matplotlib.pyplot as plt 
import scipy.stats as stats 

= np.linspace(-10,10,1000) 

= 0; sigma = 1; 

stats.norm.pdf(x,mu,sigma) 

plt.plot(x,f) 


Our next result concerns the mean and variance of a Gaussian random variable. You 
may wonder why we need this theorem when we already know that yz is the mean and o? is 
the variance. The answer is that we have not proven these two facts. 


Theorem 4.10. If X ~ Gaussian(,07), then 


and Var[X] =o”. 
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Proof. The expectation can be derived via substitution: 


a[X] = 


eae dx 


a 


(y+pje” a dy 


aie 
i 


Cy 1 ye 
=a5/ es y+ os | oie. 


oay ( : _o# dy) 
V 2102 —oo q 

(c) 

~~ Lb, 


Co 


where in (a) we substitute y = x — p, in (b) we use the fact that the first integrand is odd 
so that the integration is 0, and in (c) we observe that integration over the entire sample 
space of the PDF yields 1. 


The variance is also derived by substitution. 


1 (w=)? 
Var[X]| => oraz (a _ pe 202 dx 
2 co 
ae ye dy 


where in (a) we substitute y = (a — ps) /o. 


4.6.2 Standard Gaussian 


We need to evaluate the probability Pla < X < }] of a Gaussian random variable X in many 
practical situations. This involves the integration of the Gaussian PDF, i.e., determining the 
CDF. Unfortunately, there is no closed-form expression of Pla < X < 6] in terms of (1,07). 
This leads to what we call the standard Gaussian. 


Definition 4.14. The standard Gaussian (or standard normal) random variable X 
has a PDF 


x 


(4.27) 


That is, X ~ N(0,1) is a Gaussian with » = 0 and o? = 1. 


The CDF of the standard Gaussian can be determined by integrating the PDF. We have a 
special notation for this CDF. Figure 4.26 illustrates the idea. 
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Definition 4.15. The CDF of the standard Gaussian is defined as the ®(-) function 


il oe 
Ea (2) =i ° = dt. (4.28) 


0.5 
o4{ Axe) 
0.3 
0.2 
0.1 


0 ; 0 
5 4 3 2-101 2 3 4 «5 5 4 3-2-1012 3 4 «5 
Figure 4.26: Definition of the CDF of the standard Gaussian ®(2). 


MATLAB code to generate standard Gaussian PDF and CDF 
= linspace(-5,5,1000) ; 
normpdf (x,0,1); 
normcdf (x,0,1); 
figure; plot(x, f); 
figure; plot(x, F); 


# Python code to generate standard Gaussian PDF and CDF 
import numpy as np 
import matplotlib.pyplot as plt 
import scipy.stats as stats 
np.linspace(-10,10, 1000) 
= stats.norm. pdf (x) 
= stats.norm.cdf (x) 
-plot(x,f); plt.show() 
-plot(x,F); plt.show() 


The standard Gaussian’s CDF is related to a so-called error function defined as 
Gel e | et dt. (4.29) 
vm Jo 
It is easy to link ®() with erf(a): 


x 


®(x) = ; 1 +et(=)| , and — erf(x) = 20(2V/2) —-1. 


With the standard Gaussian CDF, we can define the CDF of an arbitrary Gaussian. 
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Theorem 4.11 (CDF of an arbitrary Gaussian). Let X ~ N (1,07). Then 


op 
aoe 


Proof. We start by expressing F'x (): 
Fx (a) =P[X < . 


_ (t=)? 
202 at. 
=[ aa 


Substituting y = , and using the definition of standard Gaussian, we have 


ue 1 fe =i CP de [~ oe r 
66 ‘apee V2 H 
ce 
(on 


If you would like to verify this on a computer, you can try the following code. 


% MATLAB code to verify standardized Gaussian 
x = linspace(-5,5,1000); 

mu = 3; sigma = 2; 

f1 = normpdf((x-mu)/sigma,0,1); % standardized 
£2 = normpdf(x, mu, sigma); h raw 


# Python code to verify standardized Gaussian 


import numpy as np 

import matplotlib.pyplot as plt 

import scipy.stats as stats 

x = np.linspace(-5,5,1000) 

mu = 3; sigma = 2; 

£1! stats.norm.pdf ((x-mu)/sigma,0,1) # standardized 
£2 stats.norm.cdf(x,mu,sigma) # raw 


An immediate consequence of this result is that 


Pla<x <i -0(—) o(—*), (4.31) 


To see this, note that 
Pla< X <b =PLX <b) -P[X <q] 


olen as eae 


The inequality signs of the two end points are not important. That is, the statement also 
holds for Pla < X < 6] or Pla < X < BJ, because X is a continuous random variable at 
every x. Thus, P[X = a] = P[X = 6] = 0 for any a and b. Besides this, ® has several 
properties of interest. See if you can prove these: 
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Corollary 4.1. Let X ~ N(,07). Then the following results hold: 


e Dy) =1—H(—-y). 


4.6.3. Skewness and kurtosis 
In modern data analysis we are sometimes interested in high-order moments. Here we con- 


sider two useful quantities: skewness and kurtosis. 


Definition 4.16. For a random variable X with PDF fx(x), define the following 
central moments as 


mean = 


variance = E 


skewness = E 


‘ . def 
kurtosis = E excess kurtosis = k — 3. 


As you can see from the definitions above, skewness is the third central moment, 
whereas kurtosis is the fourth central moment. Both skewness and kurtosis can be regarded 
as “deviations” from a standard Gaussian —not in terms of mean and variance but in terms 
of shape. 

Skewness measures the asymmetry of the distribution. Figure 4.27 shows three differ- 
ent distributions: one with left skewness, one with right skewness, and one symmetric. The 
skewness of a curve is 


e Skewed towards left: positive 
e Skewed towards right: negative 


e Symmetric: zero 


What is skewness? 


: | (As#) |. 


e Measures the asymmetry of the distribution. 


e Gaussian has skewness 0. 
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0.4 1 
=== positive skewness ry 
=@= symmetric / 1 
0.3 | | am m= negative skewness 


5 15 20 


10 


Figure 4.27: Skewness of a distribution measures the asymmetry of the distribution. In this example 
the skewnesses are: orange = 0.8943, black = 0, blue = -1.414. 


Kurtosis measures how heavy-tailed the distribution is. There are two forms of kurtosis: 
one is the standard kurtosis, which is the fourth central moment, and the other is the excess 
kurtosis, which is Kexcess = K — 3. The constant 3 comes from the kurtosis of a standard 
Gaussian. Excess kurtosis is more widely used in data analysis. The interpretation of kurtosis 
is the comparison to a Gaussian. If the kurtosis is positive, the distribution has a tail that 
decays faster than a Gaussian. If the kurtosis is negative, the distribution has a tail that 
decays more slowly than a Gaussian. Figure 4.28 illustrates the (excess) kurtosis of three 
different distributions. 


1 T T i T T T 
au kurtosis > 0 
0.8 | |==@=— kurtosis = 0 J 
=== kurtosis <0 
0.6 + J 
0.4 
0.2 


3 


Figure 4.28: Kurtosis of a distribution measures how heavy-tailed the distribution is. In this example, 
the (excess) kurtoses are: orange = 2.8567, black = 0, blue = —0.1242. 


What is kurtosis? 


° n=E|( 


Xu . 
ae 
e Measures how heavy-tailed the distribution is. Gaussian has kurtosis 3. 


e Some statisticians prefer excess kurtosis « — 3, so that Gaussian has excess 
kurtosis 0. 
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Random variable Mean Variance Skewness’ Excess kurtosis 


m o2 oy K-3 
; 1—2p 1 aval 
Bernoulli P p(1 — p) pia i= " @ : 

; : = 1-2p 6p*—6p+1 
Binomial np np(1—p) /np(—p) np(1=p) 
Geometric . — oP ye tats 

D p I—p i=) 

: 1 1 
Poisson A » , Vx oN 
Uniform ohh (s | 0 =5 
Exponential + x 2 6 
Gaussian bh o* 0 l 


Table 4.1: The first few moments of commonly used random variables. 


On a computer, computing the empirical skewness and kurtosis is done by built-in 
commands. Their implementations are based on the finite-sample calculations 


The MATLAB and Python built-in commands are shown below, using a gamma distribution 
as an example. 


MATLAB code to compute skewness and kurtosis 
X = random(’ gamma’ ,3,5,[10000,1]); 
= skewness (X) ; 
kurtosis (X) ; 


# Python code to compute skewness and kurtosis 
import scipy.stats as stats 
= stats.gamma.rvs(3,5,size=10000) 
stats.skew(X) 
stats. kurtosis (X) 


Example 4.24. To further illustrate the behavior of skewness and kurtosis, we consider 
an example using the gamma random variable X. The PDF of X is given by the 
equation 


fx(x) = me (4.32) 


where I'(-) is known as the gamma function. If k is an integer, the gamma function is 
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just the factorial: [(k) = (k — 1)!. A gamma random variable is parametrized by two 
parameters (k,0@). As k increases or decreases, the shape of the PDF will change. For 
example, when k = 1, the distribution is simplified to an exponential distribution. 

Without going through the (tedious) integration, we can show that the skewness 
and the (excess) kurtosis of Gamma(k, @) are 


skewness = 7 


(excess) kurtosis = —. 


As we can see from these results, the skewness and kurtosis diminish as k grows. This 
can be confirmed from the PDF of Gamma(k, 0) as shown in Figure 4.29. 


0.4 


0.3 


0.2 


Figure 4.29: The PDF of a gamma distribution Gamma(k, 0), where 6 = 1. The skewness and 
the kurtosis are decaying to zero. 


Example 4.25. Let us look at a real example. On April 15, 1912, RMS Titanic sank 
after hitting an iceberg. The disaster killed 1502 out of 2224 passengers and crew. A 
hundred years later, we want to analyze the data. At https: //www.kaggle.com/c/ 
titanic/ there is a dataset collecting the identities, age, gender, etc., of the passengers. 
We partition the dataset into two: one for those who died and the other one for those 
who survived. We plot the histograms of the ages of the two groups and compute 
several statistics of the dataset. Figure 4.30 shows the two datasets. 
40 40 


30 ;: 30 


20 ; 20 


20 40 20 40 60 
age age 


Group 1 (died) Group 2 (survived) 


Figure 4.30: The Titanic dataset https: //www.kaggle.com/c/titanic/. 
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Statistics Group 1 (Died) Group 2 (Survived) 
Mean 30.6262 28.3437 
Standard Deviation 14.1721 14.9510 
Skewness 0.5835 0.1795 
Excess Kurtosis 0.2652 —0.0772 
i —————————————————— 


Note that the two groups of people have very similar means and standard devia- 
tions. In other words, if we only compare the mean and standard deviation, it is nearly 
impossible to differentiate the two groups. However, the skewness and kurtosis provide 
more information related to the shape of the histograms. For example, Group 1 has 


more positive skewness, whereas Group 2 is almost symmetrical. One interpretation is 
that more young people offered lifeboats to children and older people. The kurtosis of 
Group 1 is slightly positive, whereas that of Group 2 is slightly negative. Therefore, 
high-order moments can sometimes be useful for data analysis. 


4.6.4 Origin of Gaussian random variables 


The Gaussian random variable has a long history. Here, we provide one perspective on why 
Gaussian random variables are so useful. We give some intuitive arguments but leave the 
formal mathematical treatment for later when we introduce the Central Limit Theorem. 

Let’s begin with a numerical experiment. Consider throwing a fair die. We know that 
this will give us a (discrete) uniform random variable X. If we repeat the experiment many 
times we can plot the histogram, and it will return us a plot of 6 impulses with equal height, 
as shown in Figure 4.31(a). 

Now, suppose we throw two dice. Call them X, and Xo, and let Z = X, + Xo, i-e., 
the sum of two dice. We want to find the distribution of Z. To do so, we first list out all 
the possible outcomes in the sample space; this gives us {(1,1), (1,2),...,(6,6)}. We then 
sum the numbers, which gives us a list of states of Z: {2,3,4,...,12}. The probability of 
getting these states is shown in Figure 4.31(b), which has a triangular shape. The triangular 
shape makes sense because to get the state “2”, we must have the pair (1,1), which is quite 
unlikely. However, if we want to get the state 7, it would be much easier to get a pair, e.g., 
(6, 1), (5, 2), (4, 3), (3, 4), (2,5), (1,6) would all do the job. 

Now, what will happen if we throw 5 dice and consider Z = X,+ Xo+---+X5? It turns 
out that the distribution will continue to evolve and give something like Figure 4.31(c). 
This is starting to approximate a bell shape. Finally, if we throw 100 dice and consider 
Z = Xi + Xo+--++ X00, the distribution will look like Figure 4.31(d). The shape is 
becoming a Gaussian! This numerical example demonstrates a fascinating phenomenon: As 
we sum more random variables, the distribution of the sum will converge to a Gaussian. 

If you are curious about how we plot the above figures, the following MATLAB and 
Python code can be useful. 


% MATLAB code to show the histogram of Z = X1+X2+X3 
10000; 
= randi(6,1,N); 
randi(6,1,N); 


= randi(6,1,N); 
= Xi + X2 + X83; 
histogram(Z, 2.5:18.5); 
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# Python code to show the histogram of Z = K1+X2+X3 
import numpy as np 
import matplotlib.pyplot as plt 

10000 

np.random.randint(1,6,size=N) 

= np.random.randint(1,6,size=N) 

np.random.randint (1,6,size=N) 

X1 + X2 + X3 
plt.hist(Z,bins=np.arange(2.5,18.5)) 


° 
250 50 


(a) X1 (b) X1 + Xe (ce) Xytess+Xeg (d) Xi + +++ + Xigo 


Figure 4.31: When adding uniform random variables, the overall distribution approaches a Gaussian as 
the number of summed variables increase. 


Can we provide a more formal description of this? Yes, but we need some new mathe- 
matical tools that we have not yet developed. So, for the time being, we will outline the flow 
of the arguments and leave the technical details to a later chapter. Suppose we have two 
independent random variables with identical distributions, e.g., X,; and X2, where both are 
uniform. This gives us PDFs fx, (x) and fx, (x) that are two identical rectangular functions. 
By what operation can we combine these two rectangular functions and create a triangle 
function? The key lies in the concept of convolution. If you convolve two rectangle functions, 
you will get a triangle function. Here we define the convolution of fx as 


Gages [. pas 


In fact, for any pair of random variables X; and X2 (not necessarily uniform random vari- 
ables), the sum Z = X + XQ will have a PDF given by the convolution of the two PDFs. We 
have not yet proven this, but if you trust what we are saying, we can effectively generalize 
this argument to many random variables. If we have N random variables, then the sum 
Z= X,+Xo+---+Xyn will have a PDF that is the result of N convolutions of all the 
individual PDFs. 


What is the PDF of X + Y? 


e Summing X + Y is equivalent to convolving the PDFs fx * fy. 


e If you sum many random variables, you convolve all their PDFs. 


How do we analyze these convolutions? We need a second set of tools related to Fourier 
transforms. The Fourier transform of a PDF is known as the characteristic function, which 
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we will discuss later, but the name is not important now. What matters is the important 
property of the Fourier transform, that a convolution in the original space is multiplication 
in the Fourier space. That is, 


Fife ® Fx eee # fx)} =F ix Fife} ee Pigs 


Multiplication in the Fourier space is much easier to analyze. In particular, for independent 
and identically distributed random variables, the multiplication will easily translate to ad- 
dition in the exponent. Then, by truncating the exponent to the second order, we can show 
that the limiting object in the Fourier space is approaching a Gaussian. Finally, since the 
inverse Fourier transform of a Gaussian remains a Gaussian, we have shown that the infinite 
convolution will give us a Gaussian. 


Here is some numerical evidence for what we have just described. Recall that the 
Fourier transform of a rectangle function is the sinc function. Therefore, if we have an 
infinite convolution of rectangular functions, equivalently, we have an infinite product of sinc 
functions in the Fourier space. Multiplying sinc functions is reasonably easy. See Figure 4.32 
for the first three sincs. It is evident that with just three sinc functions, the shape closely 
approximates a Gaussian. 


1.25 i 1 ; 
1/ (sin x)/x 
== sin x)?/x? 
0.75 | = (sin x)°/x>| | 


-0.25 5 


-0.5 1 1 Ll L 1 1 1 1 L 
"10 -8 -6 4 2 0 2 4 6 8 10 


Figure 4.32: Convolving the PDF of a uniform distribution is equivalent to multiplying their Fourier 
transforms in the Fourier space. As the number of convolutions grows, the product is gradually becoming 
Gaussian. 


How about distributions that are not rectangular? We invite you to numerically visu- 
alize the effect when you convolve the function many times. You will see that as the number 
of convolutions grows, the resulting function will become more and more like a Gaussian. 
Regardless of what the input random variables are, as long as you add them, the sum will 
have a distribution that looks like a Gaussian: 


X,+ Xo+---+ Xn ~ Gaussian. 


We use the notation ~ to emphasize that the convergence is not the usual form of conver- 
gence. We will make this precise later. 


The implication of this line of discussion is important. Regardless of the underlying 
true physical process, if we are only interested in the sum (or average), the distribution 
will be more or less Gaussian. In most engineering problems, we are looking at the sum 
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or average. For example, when generating an image using an image sensor, the sensor will 
add a certain amount of read noise. Read noise is caused by the random fluctuation of the 
electrons in the transistors due to thermal distortions. For high-photon-flux situations, we 
are typically interested in the average read noise rather than the electron-level read noise. 
Thus Gaussian random variables become a reasonable model for that. In other applications, 
such as imaging through a turbulent medium, the random phase distortions (which alter 
the phase of the wavefront) can also be modeled as a Gaussian random variable. Here is the 
summary of the origin of a Gaussian random variable: 


What is the origin of Gaussian? 
e When we sum many independent random variables, the resulting random vari- 
able is a Gaussian. 


e This is known as the Central Limit Theorem. The theorem applies to any ran- 
dom variable. 


e Summing random variables is equivalent to convolving the PDFs. Convolving 
PDFs infinitely many times yields the bell shape. 


4.7 Functions of Random Variables 


One common question we encounter in practice is the transformation of random variables. 
The question can be summarized as follows: Given a random variable X with PDF fx(z) 
and CDF Fx (x), and supposing that Y = g(X) for some function g, what are fy(y) and 
Fy(y)? This is a prevalent question. For example, we measure the voltage V, and we want 
to analyze the power P = V?/R. This involves taking the square of a random variable. 
Another example: We know the distribution of the phase 0, but we want to analyze the 
signal cos(wt + ©). This involves a cosine transformation. How do we convert one variable 
to another? Answering this question is the goal of this section. 


4.7.1 General principle 


We will first outline the general principle for tackling this type of problem. In the following 
subsection, we will give a few concrete examples. 

Suppose we are given a random variable X with PDF fx(x) and CDF F(x). Let Y = 
g(X) for some known and fixed function g. For simplicity, we assume that g is monotonically 
increasing. In this case, the CDF of Y can be determined as follows. 
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This sequence of steps is not difficult to understand. Step (a) is the definition of CDF. Step 
(b) substitutes g(X) for Y. Step (c) uses the fact that since g is invertible, we can apply 
the inverse of g to both sides of g(X) < y to yield X < g~'(y). Step (d) is the definition of 
the CDF, but this time applied to P[X < &] = Fx (#), for some &. 

It will be useful to visualize the situation in Figure 4.33. Here, we consider a uniformly 
distributed X so that the CDF F'x(x) is a straight line. According to Fx, any samples 
drawn according to Fx are equally likely, as illustrated by the yellow dots on the z-axis. 
As we transform the X’s through Y = g(X), we increase/decrease the spacing between 
two samples. Therefore, some samples become more concentrated while some become less 
concentrated. The distribution of these transformed samples (the yellow dots on the y-axis) 
forms a new CDF Fy(y). The result Fy(y) = Fx(g~*(y)) holds when we look at Y. The 
samples are traveling with g~‘ in order to go back to Fy. Therefore, we need g~! in the 
formula. 


distributed according to Fy-(y) 


distributed according to Fx (x) 


fy (y) 


n 
tandom 
I 
' 
! 
' 
! 
I 
I 
I 


Figure 4.33: When transforming a random variable X to Y = g(X), the distributions are defined 
according to the spacing between samples. In this figure, a uniformly distributed X will become squeezed 
by some parts of g and widened in other parts of g. 


Why should we use the CDF and not the PDF in Figure 4.33? The advantage of the 
CDF is that it is an increasing function. Therefore, no matter what the function g is, the 
input and the output functions will still be increasing. If we use the PDF, then the non- 
monotonic behavior of the PDF will interact with another nonlinear function g. It becomes 
much harder to decouple the two. 

We can carry out the integrations to determine Fy(g~+(y)). It can be shown that 


g*(y) 


Feo) =f fxle') ae’ (4.33) 
and hence, by the fundamental theorem of calculus, we have 
d d Z d g*(y) 
f= FAW) =FFxo tw) =f  fx(e’) de 
dy dy i 
dg! = 
-(S) portw, as) 
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where the last step is due to the chain rule. Based on this line of reasoning we can summarize 
a “recipe” for this problem. 


How to find the PDF of Y = g(X) 
e Step 1: Find the CDF Fy(y), which is Fy(y) = Fx(g71(y)). 


© Step 2: Find the PDF fy(y), which is fy(y) = (2524) - fx(g7Hy)). 


This recipe works when g is a one-to-one mapping. If g is not one-to-one, e.g., g(x) = x? 
lmphes g™ =x en we will have some issues wit e above two steps. en this 
implies g~!(y) Jy, th ill h i ith the ab ps. When thi 


happens, then instead of writing X < g~'(y) we need to determine the set {zx | g(x) < y}. 


4.7.2 Examples 


Example 4.26. (Linear transform) Let X be a random variable with PDF fx (a) and 
CDF Fx (a). Let Y = 2X +3. Find fy(y) and Fy(y). Express the answers in terms of 
fx(x) and Fy (2). 


Solution. We first note that 


Fy(y) = 


Therefore, the PDF is 


fry) = =F) 
a) 
Se aia )e 


Follow-Up. (Linear transformation of a Gaussian random variable).Suppose X is a Gaus- 
sian random variable with zero mean and unit variance, and let Y = aX +b. Then the CDF 
and PDF of Y are respectively 


Follow-Up. (Linear transformation of an exponential random variable). Suppose X is an 
exponential random variable with parameter A, and let Y = aX + b. Then the CDF and 
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PDF of Y are respectively 


Example 4.27. Let X be a random variable with PDF fx (a) and CDF Fx (a). Sup- 
posing that Y = X?, find fy(y) and Fy(y). Express the answers in terms of fx(z) 
and Fx ( ). 


Solution. We note that 


Therefore, the PDF is 


fy(y) = Gs 


d 


= 5 (FeV) - Fe(-vi)) 
= Fe(VO-EVI- Pe(-VO= (vi) 


z: a cj fay. 


distributed according to Fy(y) 


1 
fy(y) = Vile —a) 


Figure 4.34: When transforming a random variable X to Y = X*, the CDF becomes Fy (y) = 


A and the PDF becomes fy(y) = aay 
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Follow Up. (Square of a uniform random variable) Suppose X is a uniform random variable 
in [a,b] (assume a > 0), and let Y = X?. Then the CDF and PDF of Y are respectively 


naj * aye’, 


b-a’ 
1 
fy (y) = Jy(b— a)’ a? Swe. 


Example 4.28. Let X ~ Uniform(0, 27). Suppose Y = cos X. Find fy(y) and Fy(y). 
Solution. First, we need to find the CDF of X. This can be done by noting that 


x 


zx x 1 
Fx(x) = f f(a!) da! = f so de! =. 


Thus, the CDF of Y is 


Fy(y) =PlY < y] = Pleos X < y] 


= Plcos' y < X < 2m — cos" y| 
1 


= Fx (2m — cos! y) — Fx(cos~' y) 


The PDF of Y is 


m/l —y?’ 


where we used the fact. that = cost y = 


Example 4.29. Let X be a random variable with PDF 


x ,—ae™ 


fx(x) = ae*e 


Let Y = e*, and find fy (y). 
Solution. We first note that 


Fy(y) =P[Y < yl 
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To find the PDF, we recall the fundamental theorem of calculus. This gives us 


Closing remark. The transformation of random variables is a fundamental technique in 
data science. The approach we have presented is the most rudimentary yet the most intuitive. 
The key is to visualize the transformation and how the random samples are allocated after 
the transformation. Note that the density of the random samples is related to the slope of 
the CDF. Therefore, if the transformation maps many samples to similar values, the slope 
of the CDF will be steep. Once you understand this picture, the transformation will be a 
lot easier to understand. 

Is it possible to replace the paper-and-pencil derivation of a transformation with a 
computer? If the objective is to transform random realizations, then the answer is yes 
because your goal is to transform numbers to numbers, which can be done on a computer. 
For example, transforming a sample x, to ,/21 is straightforward on a computer. However, 
if the objective is to derive the theoretical expression of the PDF, then the answer is no. 
Why might we want to derive the theoretical PDF? We might want to analyze the mean, 
variance, or other statistical properties. We may also want to reverse-engineer and determine 
a transformation that can yield a specific PDF. This would require a paper-and-pencil 
derivation. In what follows, we will discuss a handy application of the transformations. 


What are the rules of thumb for transformation of random variables? 


e Always find the CDF Fy(y) = P[g(X) < y]. Ask yourself: What are the values 
of X such that g(X) < y? Think of the cosine example. 


e Sometimes you do not need to solve for Fy(y) explicitly. The fundamental the- 


orem of calculus can help you find fy(y). 


e Draw pictures. Ask yourself whether you need to squeeze or stretch the samples. 


4.8 Generating Random Numbers 


Most scientific computing software nowadays has built-in random number generators. For 
common types of random variables, e.g., Gaussian or exponential, these random number 
generators can easily generate numbers according to the chosen distribution. However, if we 
are given an arbitrary PDF (or PMF) that is not among the list of predefined distributions, 
how can we generate random numbers according to the PDF or PMF we want? 
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4.8.1 General principle 


Generating random numbers according to the desired distribution can be formulated as 
an inverse problem. Suppose that we can generate uniformly random numbers according 
to Uniform(0,1). This is a fragile assumption, and this process can be done on almost all 
computers today. Let us call this random variable U and its realization u. Suppose that we 
also have a desired distribution fx(a#) (and its CDF Fx (x)). We can put the two random 
variables U and X on the two axes of Figure 4.35, yielding an input-output relationship. 
The inverse problem is: By using what transformation g, such that X = g(U), can we make 
sure that X is distributed according to fx (x) (or Fx(x))? 


distributed according to Fx (x) 


distributed according to Fy (u) 


Figure 4.35: Generating random numbers according to a known CDF. The idea is to first generate a 
uniform(0,1) random variable, then do an inverse mapping Fx '. 


Theorem 4.12. The transformation g that can turn a uniform random variable into 
a random variable following a distribution Fx (a) ts given by 


(4.35) 


That is, if g = Fx, then g(U) will be distributed according to fx (or Fx). 


Proof. First, we know that if U ~ Uniform(0,1), then fy(u) = 1 for 0 <u <1, so 
Fu(u) = / fu(u) du = u, 


for 0<u<1. Let g = Fx! and define Y = g(U). Then the CDF of Y is 


Fy(y) =PlY < y] =PlgV) < yl 
=P[Fx'(U) <y] 
] 


Therefore, we have shown that the CDF of Y is the CDF of X. 
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The theorem above states that if we want a distribution Fx, then the transformation 
should be g = Fy ‘| This suggests a two-step process for generating random numbers. 


How do we generate random numbers from an arbitrary distribution F', ? 


e Step 1: Generate a random number U ~ Uniform(0, 1). 


e Step 2: Let 
Veer 


Then the distribution of Y is F'y. 


4.8.2 Examples 


Example 4.30. How can we generate Gaussian random numbers with mean p and 
variance o? from uniform random numbers? 


First, we generate U ~ Uniform(0, 1). The CDF of the ideal distribution is 


F(x) = (4). 


on 


Therefore, the transformation g is 


In Figure 4.36, we plot the CDF of Fx and the transformation g. 


1 
0.9} 
0.87 
0.7} 
0.6 + 
0.5} 
0.4} 
0.3} 
0.2; 


OVOANWAUDANB®OEOAR 
ee 


10 0 O14 02 03 04 05 06 0.7 08 09 1 


(a) Fx (>) (b) 9) 


Figure 4.36: To generate random numbers according to Gaussian(0, 1), we plot its CDF in (a) 
and the transformation g in (b). 


To visualize the random variables before and after the transformation, we plot 
the histograms in Figure 4.37. 
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0.2 0.4 0.6 
(a) PDF of U 


(b) PDF of g(U) 


Figure 4.37: (a) PDF of the uniform random variable. (b) The PDF of the transformed random 


variable. 


The MATLAB and Python codes used to generate 
below. 


% MATLAB code to generate Gaussian from uniform 
mu = 

sigma = 

U rand(10000, 1); 

gu = sigma*icdf(’norm’ ,U,0,1)+mu; 

figure; hist(U); 

figure; hist(gU); 


# Python code to generate Gaussian from uniform 
import numpy as np 

import matplotlib.pyplot as plt 

import scipy.stats as stats 


mu = 3 

sigma 

U = stats.uniform.rvs(0,1,size=10000) 
gU = sigma*stats.norm.ppf(U)+mu 
plt.hist(U); plt.show( 

plt.hist(gU); plt.show() 


the histograms above are shown 


Example 4.31. How can we generate exponential random numbers with parameter 


from uniform random numbers? 


First, we generate U ~ Uniform(0,1). The CDF of the ideal distribution is 


Pusat =e 
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Therefore, the transformation g is 


g(U) = F'(U) = —+ log(1 - 0). 


The CDF of the exponential random variable and the transformation g are shown 
in Figure 4.38. 


1 
0.9 - 
0.8; 
0.7; 
0.6; 
0.5 - 


2 3 , 102 03 04 05 06 07 08 09 1 
(a) Fx(-) (b) g() 


Figure 4.38: To generate random numbers according to Exponential(1), we plot its CDF in (a) 
and the transformation g in (b). 


The PDF of the uniform random variable U and the PDF of the transformed 
variable g(U) are shown in Figure 4.39. 


400 5 


0.4 0.6 \ 2 4 6 8 


a) PDF of U (b) PDF of g(U) 


Figure 4.39: (a) PDF of the uniform random variable. (b) The PDF of the transformed random 
variable. 


The MATLAB and Python codes for this transformation are shown below. 


% MATLAB code to generate exponential random variables 
lambda 1; 


U = rand(10000,1); 
gu -(1/lambda) *log(1-U) ; 


# Python code to generate exponential random variables 
import numpy as np 
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import scipy.stats as stats 


lambd = 1; 
U stats.uniform.rvs(0,1,size=10000) 
gu -(1/lambd) «np. log (1-U) 


Example 4.32. How can we generate the 4 integers 1,2,3,4, according to the his- 
togram [0.1 0.5 0.3 0.1], from uniform random numbers? 


First, we generate U ~ Uniform(0, 1). The CDF of the ideal distribution is 


Gat, 
0.14+0.5 = 0.6, 
0.1+0.5+0.3 =0.9, 
0.1+0.5+0340.1 =1.0, 


Fx (a) = 


This CDF is not invertible. However, we can still define the “inverse” mapping 


1 AO er emer IIE 
2 O10 <06, 
5, 0.6<U<09, 
4, Reda 


For example, if 0.1 < U < 0.6, then on the black curve shown in Figure 4.40(a), we 
are looking at the second vertical line from the left. This will go to “2” on the x-axis. 
Therefore, the inversely mapped value is 2 for 0.1 < U < 0.6. 

4 


(a) Fx(-) 
Figure 4.40: To generate random numbers according to a predefined histogram, we first define 
the CDF in (a) and the corresponding transformation in (b). 


The PDFs of the transformed variables, before and after, are shown in Fig- 
ure 4.41. 
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0.4 0.6 


a) PDF of U ) PDF a, g(U 


Figure 4.41: (a) PDF of the uniform random variable. (b) The PDF of the transformed random 
variable. 


In MATLAB, the above PDFs can be plotted using the commands below. In Python, 
we need to use the logical comparison np. logical_and to identify the indices. An alternative 
is to use gU[( (U<=0.5) * (U>=0.0)) .astype(np.bool1)]=1. 


% MATLAB code to generate the desired random variables 
U = rand(10000,1); 

gU = zeros(10000,1); 

gUC((U>=0) & (U<=0.1)) = 

gU((U>0.1) & (U<=0.6)) 

gU((U>0.6) & (U<=0.9)) 

gU((U>0.9) & (U<=1)) 


# Python code to generate the desired random variables 
import numpy as np 
import scipy.stats as stats 


U stats.uniform.rvs(0,1,size=10000) 
gU = np.zeros (10000) 

gU[np.logical_and(U >= 0.0, U <= 0.1)] = 
gU[np.logical_and(U > 0.1, U <= 0.6)] 
gU[np.logical_and(U > 0.6, U <= 0.9)] 
gU[np.logical_and(U > 0.9, U <= 1)] 


4.9 Summary 


Let us summarize this chapter by revisiting the four bullet points from the beginning of the 
chapter. 


e Definition of a continuous random variable. Continuous random variables are mea- 
sured by lengths, areas, and volumes, which are all defined by integrations. This makes 
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them different from discrete random variables, which are measured by counts (and 
summations). Because of the different measures being used to define random variables, 
we consequently have different ways of defining expectation, variance, moments, etc., 
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4.11 Problems 


Exercise 1. (VIDEO SOLUTION) 
Let X be a Gaussian random variable with = 5 and o? = 16. 


a) Find P[X > 4] and P[2 < X < 7]. 


( 
(b 
(c 


(d 


If PLX < a] = 0.8869, find a. 
If PLX > b] = 0.1131, find b. 
If P[13 <_X <c] =0.0011, find c. 


Exercise 2. (VIDEO SOLUTION) 
Compute E[Y] and E[Y?] for the following random variables: 


(a) Y = Acos(wt + 0), where A ~ N(, 07). 
(b) Y = acos(wt + 0), where 0 ~ Uniform(0, 27). 
(c) Y =acos(wI' + 0), where T ~ Uniform (—2, ). 


aw 


Exercise 3. (VIDEO SOLUTION) 
Consider a CDF 


0, ifa < —-l, 
ae 0.5, if -—l<a<0, 

(l+<2)/2, if0<a<1, 

1 otherwise. 


9 


(a) Find PLX < —1], P[-0.5 < X < 0.5] and P[X > 0.5). 
(b) Find fx (za). 


Exercise 4. (VIDEO SOLUTION) 
A random variable X has CDF: 


0 if «<0 
F = o] y] 
x(@) ees if x>0. 


(a) Find P[X < 2], PLX = 0], P[X <0], P[2 < X <6] and P[X > 10]. 
(b) Find fx (a). 
Exercise 5. (VIDEO SOLUTION) 


A random variable X has PDF 


fxe() ex(1 — 27), 0<2<1, 
r)= 
‘ 0, otherwise. 


Find c, Fy (x), and E[X]. 


Exercise 6. (VIDEO SOLUTION) 
A continuous random variable X has a cumulative distribution 


0, xz <0, 
Fy (x) = ¢0.5+csin?(rz/2), O<2<1, 
1, o>, 


(a) What values can c assume? 


(b) Find fx(z). 


4.11. 


PROBLEMS 
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Exercise 7. (VIDEO SOLUTION) 
A continuous random variable X is uniformly distributed in [—2, 2]. 


(a) Let Y = sin(7X/8). Find fy(y). 
(b) Let Z = -2X? +3. Find fz(z). 


Hint: Compute Fy(y) from Fx (a), and use 4 sins} y = =: 
-y 


Exercise 8. 
Let Y =e*. 


(a) Find the CDF and PDF of Y in terms of the CDF and PDF of X. 


(b) Find the PDF of Y when X is a Gaussian random variable. In this case, Y is said to 
be a lognormal random variable. 


Exercise 9. 
The random variable X has the PDF 


i 
ss O0<a<il, 
Pele) = ; 
0, otherwise. 
Let Y be a new random variable 
0, X <0, 
Yeatw Osx <1 

1, X >I. 


Find Fy(y) and fy(y), for —oo < y < co. 


Exercise 10. 
A random variable X has the PDF 


2xe* , x > 0, 
feta) = 42 a <0. 
Let 
1—e-* X>0 
Y =9(X) = ’ ae 
(X) {t X <0. 


Find the PDF of Y. 


Exercise 11. 
A random variable X has the PDF 


fx(z) = =e7!", -0 <B< OO. 
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Let Y = g(X) =e-*. Find the PDF of Y. 


Exercise 12. 
A random variable X has the PDF 


1 a? 
xv) = —=—_-2e? -0O <2< OH. 
Ix Vino 
Find the PDF of Y where 
X, X|> K, 
¥ =9(X) = Fal 
—X, |X| <K. 
Exercise 13. 
A random variable X has the PDF 
1 a2 
z)=— =e 7, -—0O <4L< Ow. 
fx( ) e2/ on 
Let Y = g(X) = x: Find the PDF of Y. 
Exercise 14. 
A random variable X has the CDF 
0, x <0, 
Fx (x) = 4 2%, Osa 1, 
1; z>1, 


Exercise 15. 

Energy efficiency is an important aspect of designing electrical systems. In some modern 

buildings (e.g., airports), traditional escalators are being replaced by a new type of “smart” 

escalator which can automatically switch between a normal operating mode and a standby 

mode depending on the flow of pedestrians. 

(a) The arrival of pedestrians can be modeled as a Poisson random variable. Let N be the 

number of arrivals, and let be the arrival rate (people per minute). For a period of 
t minutes, show that the probability that there are n arrivals is 


P(N =n) = aos a, 


n! 


(b) Let T be a random variable denoting the interarrival time (i.e., the time between two 
consecutive arrivals). Show that 


Pe Spee", 
Also, determine F(t) and fr(t). Sketch fr(t). 
(Hint: Note that P(T' > t) = P(no arrival in ¢ minutes).) 
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(c) Suppose that the escalator will go into standby mode if there are no pedestrians for 
to = 30 seconds. Let Y be a random variable denoting the amount of time that the 
escalator is in standby mode. That is, let 


yal es ae 
TP ee TER 


Find E[Y]. 
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Joint Distributions 


When you go to aconcert hall, sometimes you may want to see a solo violin concert, but other 
times you may want to see a symphony. Symphonies are appealing because many instruments 
are playing together. Random variables are similar. While single random variables are useful 
for modeling simple events, we use multiple random variables to describe complex events. 
The multiple random variables can be either independent or correlated. When many random 
variables are present in the problem, we enter the subject of joint distribution. 


What are joint distributions? 


In the simplest sense, joint distributions are extensions of the PDFs and PMFs we studied 
in the previous chapters. We summarize them as follows. 


What do we mean by a high-dimensional PDF? We know that a single random variable is 
characterized by a 1-dimensional PDF fx (a). If we have a pair of random variables, then 
we use a 2-dimensional function fx (x,y), and if we have a triplet of random variables, 
we use a 3-dimensional function fx y,z(x, y,z). In general, the dimensionality of the PDF 
grows as the number of variables: 


x(@) => fixy.x,(€1, 02) => +++ => fx, xy (F1,---, 2). 
aa eS SS eS SS 
one variable two variables N variables 
For busy engineers like us, fx,,....x,y(@1,---,%N) is not a friendly notation. A more con- 
cise way to write fx, .xy(%1,..-,¢N) is to define a vector of random variables X = 
[X1, Xo,...,Xw]” with a vector of states x = [x1,22,...,y]", and to define the PDF as 


fx(x) = fx,,...xy(U1,---,0N). 


Under what circumstance will we encounter creatures like fx (a)? Believe it or not, 
these high-dimensional PDF's are everywhere. In 2010, computer-vision scientists created 
the ImageNet dataset, containing 14 million images with ground-truth class labels. This 
enormous dataset has enabled a great blossoming of machine learning over the past several 
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Figure 5.1: Joint distributions are ubiquitous in modern data analysis. For example, an image from a 
dataset can be represented by a high-dimensional vector a. Each vector has a certain probability of 
being present. This probability is described by the high-dimensional joint PDF fx (a). The goal of this 
chapter is to understand the properties of this fx. 


0.5 0.5 
0.4 0.4 
0.3 0.3 
0.2 0.2 
0.1 0.1 
0 0 
5 4-3-2-10123465 5 -4-3-2-10123 45 
x y 


Figure 5.2: A 2-dimensional PDF fx,y(x,y) of a pair of random variables (X,Y) and their respective 
1D PDFs fx (x) and fy(y). 


decades, in which many advances in deep learning have been made. Fundamentally, the 
ImageNet dataset provides a large collection of samples drawn from a latent distribution 
that is high-dimensional. Each sample in the ImageNet dataset is a 224 x 224 x 3 image (the 
three numbers stand for the image’s height, width, and color). If we convert this image into 
a vector, then the sample will have a dimension of 224 x 224 x 3 = 150,528. In other words, 
the sample is a vector x € R!°°?8<!_ The probability of obtaining a particular sample x 
is determined by probability density function fx (a). For example, it is more likely to get 
an image containing trees than one containing a Ferrari. The manifold generated by fx (x) 
can be extremely complex, as illustrated in Figure 5.1. 

The story of ImageNet is just one of the many instances for which we use a joint 
distribution fx(a). Joint distributions are ubiquitous. If you do data science, you must 
understand joint distributions. However, extending a 1-dimensional function fx(x) to a 
2-dimensional function fx y(,y) and then to a N-dimensional function fx (a) is not trivial. 
The goal of this chapter is to guide you through these important steps. 


Plan of Part 1 of this chapter: Two variables 


This chapter is broadly divided into two halves. In the first half, we will look at a pair of 
random variables. 


e Definition of fx (x,y). The first thing we need to learn is the definition of a joint 
distribution with two variables. Since we have two variables, the joint probability 
density function (or probability mass function) is a 2-dimensional function. A point 
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on this 2D function is the probability density evaluated by a pair of variables X = x 
and Y = y, as illustrated in Figure 5.2. However, how do we formally define this 2D 
function? How is it related to the probability measure? Is there a way we can retrieve 
fx(x) and fy(y) from fx y(a, y), as illustrated on the right-hand sides of Figure 5.2? 
These questions will be answered in Section 5.1. 


Joint expectation E[XY]. When we have a pair of random variables, how should we 
define the expectation? In Section 5.2, we will show that the most natural way to define 
the joint expectation is in terms of E[XY], i.e., the expectation of the product. There 
is a surprising and beautiful connection between this “expectation of the product” and 
the cosine angle between two vectors, thereby showing that E[XY] is the correlation 
between X and Y. 


The reason for studying a pair of random variables is to spell out the cause-effect 
relationship between the variables. This cannot be done without conditional distri- 
butions; this will be explained in Section 5.3. Conditional distributions provide an 
extremely important computational tool for decoupling complex events into simpler 
events. Such decomposition allows us to solve difficult joint expectation problems via 
simple conditional expectations; this subject will be covered in Section 5.4. 


If you recall our discussions about the origin of a Gaussian random variable, we claimed 
that the PDF of X + Y is the convolution between fx and fy. Why is this so? We 
will answer this question in terms of joint distributions in Section 5.5. 


Plan of Part 2 of this chapter: N variables 


The second half of the chapter focuses on the general case of N random variables. This 
requires the definitions of a random vector X = [X1,...,Xy]7, a joint distribution fx (x), 


and the corresponding expectations 


(| X]. To make our discussions concrete, we will focus 


on the case of high-dimensional Gaussian random variables and discuss the following topics. 


e Covariance matrices/correlation matrices. If a pair of random variables can define 


the correlation through the expectation of the product E[X1X2], then for a vector of 
random variables we can consider a matrix of correlations in the form 


1[X1X4] (|X ,Xo] -:: [Xi Xn] 

t[_X2X4] [XoXo] -:: |XX] 
R= : . : 

[Xv X4] EX yy X9] tee LX ny X ny] 


What are the properties of the matrix? How does it affect the shape of the high- 
dimensional Gaussian? If we have a dataset of vectors, how do we estimate this matrix 
from the data? We will answer these questions in Section 5.6 and Section 5.7. 


Principal-component analysis. Given the covariance matrix, we can perform some 
very useful data analyses, such as the principal-component analysis in Section 5.8. 
The question we will ask is: Among the many components, which one is the principal 
component? If we can find the principal component(s), we can effectively perform 
dimensionality reduction by projecting a high-dimensional vector into low-dimensional 
representations. We will introduce an application for face detection. 
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Figure 5.3: When there is a pair of random variables, we can regard the sample space as a set of 
coordinates. The random variables are 2D mappings from a coordinate w in Qx x Qy to another 
coordinate X(w) in R?. 


5.1 Joint PMF and Joint PDF 


Probability is a measure of the size of a set. This principle applies to discrete random vari- 
ables, continuous random variables, single random variables, and multiple random variables. 
In situations with a pair of random variables, the measure should be applied to the coordi- 
nate (X,Y) represented by the random variables X and Y. Consequently, when measuring 
the probability, we either count these coordinates or integrate the area covered by these 
coordinates. In this section, we formalize this notion of measuring 2D events. 


5.1.1 Probability measure in 2D 


Consider two random variables X and Y. Let the sample space of X and Y be Qx and 
Qy, respectively. Define the Cartesian product of Qx and Qy as Ox x Oy = {(a,y) | x e€ 
Qx and y € Oy}. That is, Qx x Oy contains all possible pairs (X,Y). 


Example 5.1. If Qx = {1,2} and Qy = {4,5}, then Qx x Oy 


(2, 4), (2, 5)}. 
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Example 5.2. If Qx = [8,4] and Qy = [1,2], then Qx x Qy = a rectangle with two 


diagonal vertices as (3,1) and (4, 2). 


Random variables are mappings from the sample space to the real line. If w € Qx is 
mapped to X(w) € R, and € € Qy is mapped to Y(&) € R, then a coordinate w = (w,€) in 
the sample space 1.x x Qy should be mapped to a coordinate (X(w), Y (€)) in the 2D plane. 


def [w xX 5 def 
[ye] 2x6 
We denote such a vector-to-vector mapping as X(-) : Qx x Qy > R x R, as illustrated in 


Figure 5.3. 
Therefore, if we have an event A € R?, the probability that A happens is 


* all] | ele 
-al{lex al] 


=Plw e X71(A)j. 


In other words, we take the coordinate X(w) and find its inverse image X~'(A). The size 
of this inverse image X~'(A) in the sample space Qx x Qy is then the probability. We 
summarize this general principle as follows. 


How to measure probability in 2D 


For a pair of random variables X = (X,Y), the probability of an event A is measured 
in the product space Qx x Qy with the size 


P[{w | X~*(A)}]. 


This definition is quite abstract. To make it more concrete, we will look at discrete and 
continuous random variables. 


5.1.2 Discrete random variables 
Suppose that the random variables X and Y are discrete. Let A = {X(w) = x, Y(€) = y} 
be a discrete event. Then the above definition tells us that the probability of A is 


P/A] = P|. | X(w) =a, and Y(€) = y| =P[X =a and Y = yj. 
ed 
xy (#,y) 


We define this probability as the joint probability mass function (joint PMF) px y(2, y). 
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Definition 5.1. Let X and Y be two discrete random variables. The joint PMF of X 
and Y is defined as 


pxy(2,y) =P[X =a andY =y| = P| | X(w) =a, and Y(E) = y}. (5.1) 


We sometimes write the joint PMF as px y(z,y) =P[X =a, Y =y/J. 


px,y (x,y) 


Y --,--- px,y (x,y) 


' 
1 
! 
x 
Figure 5.4: A joint PMF for a pair of discrete random variables consists of an array of impulses. To 
measure the size of the event A, we sum all the impulses inside A. 


Figure 5.4 shows a graphical portrayal of the joint PMF. In a nutshell, px y (x,y) 
can be considered as a 2D extension of a single variable PMF. The probabilities are still 
represented by the impulses, but the domain of these impulses is now a 2D plane. If we have 
an event A, then the size of the event is 


PLA] = > px,y (x,y). 
(x,y)EA 


Example 5.3. Let X be a coin flip, Y be a die. The sample space of X is {0,1}, 
whereas the sample space of Y is {1,2,3,4,5,6}. The joint PMF, according to our 
definition, is the probability PLX = # and Y = y], where z takes a binary state and Y 
takes one of the 6 states. The following table summarizes all the 12 states of the joint 
distribution. 


i 2 5 


= 1 1 
xo 12 12 12 12 


fi 1 1 1 1 1 

X-llpp p 12 12 
In this table, since there are 12 coordinates, and each coordinate has an equal 
chance of appearing, the probability for each coordinate becomes 1/12. Therefore, the 


joint PMF of X and Y is 


Y 
3 
1 


1 
Pxy (2,9) = 75 x= 0,1, y = 1,2,3,4,5,6. 
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In this example, we observe that if X and Y are not interacting with each other (for- 
mally, independent), the joint PMF is the product of the two individual probabilities. 


Example 5.4. In the previous example, if we define A = {X +Y = 3}, the probability 
P[A] is 


PIA|= >> pxy(z,y) =px,y (0,3) + pxy(1,2) 
(a,y)EA 
2 
=>: 


If B = {min(X, Y) = 1}, the probability P[B] is 
PIB)= SD oxv(oy) 
(x,y)EB 


= px y(1,1)4+pxy(1,2) + px,y (1, 3) 
+ pxy (1,4) + px.y (1, 5) + pxy (1, 6) 


5.1.3. Continuous random variables 


The continuous version of the joint PMF is called the joint probability density function 
(joint PDF), denoted by fx y(a,y). A joint PDF is analogous to a joint PMF. For example, 
integrating it will give us the probability. 


Definition 5.2. Let X and Y be two continuous random variables. The joint PDF of 
X andY is a function fx y(ax,y) that can be integrated to yield a probability 


PLA) = I, peers (5.2) 


for any event AC Ox x Oy. 


Pictorially, we can view fx,y as a 2D function where the height at a coordinate (a, y) is 
fx.y (x,y), as can be seen from Figure 5.5. To compute the probability that (X,Y) € A, 
we integrate the function fx,y with respect to the area covered by the set A. For example, 
if the set A is a rectangular box A = [a,}] x [c,d], then the integration becomes 


PIA] =Pla< X <b, c<Y<d 


7 [ - Peer 
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fx,y (x, y) 


A 


Figure 5.5: A joint PDF for a pair of continuous random variables is a surface in the 2D plane. To 
measure the size of the event A, we integrate fx,y (x,y) inside A. 


Example 5.5. Consider a uniform joint PDF fx y (zx, y) defined on [0, 2]? with fx y(z, y) 
;- Let A = [a,}] x [c,d]. Find P[A]. 


Solution. 


PA =j=PaesX <b, 252 <a 


=f [tron dx ay= f° [av ay — 


Practice Exercise 5.1. In the previous example, let B = {X + Y < 2}. Find P[B]. 


Solution. 


= | iret) deen 
B 


2 2-y 
= | i fx,y(2,y) dx dy 
0 Jo 
2 2-y 1 
= — dx dy 
vo 
5) 
2-y 1 
[ i aD 


Here, the limits of the integration can be determined from Figure 5.6. The inner 
integration (with respect to x) should start from 0 and end at 2 — y, which is the line 
defining the set x+y < 2. Since the inner integration is performed for every y, we 
need to enumerate all the possible y’s to complete the outer integration. This leads to 
the outer limit from 0 to 2. 
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Figure 5.6: To calculate PLX + Y < 2], we perform a 2D integration over a triangle. 


5.1.4 Normalization 


The normalization property of a two-dimensional PMF and PDF is the property that, when 
we enumerate all outcomes of the sample space, we obtain 1. 


Theorem 5.1. Let Q=Qx x Oy. All joint PMFs and joint PDFs satisfy 


SI exe) on i Fx,y(z,y) dx dy =1. 
QD 


(xz, y)EQ 


Example 5.6. Consider a joint uniform PDF defined in the shaded area [0,3] x [0,3] 
with PDF defined below. Find the constant c. 


cif (w,y) € [0,3] x [0,3], 
0 otherwise. 


fxy(z,y) = 


Solution. To find the constant c, we note that 


3 3 
1 =| | fx,y (x,y) da dy 
0 0 


3 73 
= | c dz dy = 9c. 
0 Jo 


Equating the two sides gives us c = §. 


Practice Exercise 5.2. Consider a joint PDF 


ce *e¥ VSUYS #< CO, 


fxy(z,y) = 


0 otherwise. 


Find the constant c. Tip: Consider the area of integration as shown in Figure 5.7. 
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Solution. There are two ways to take the integration shown in Figure 5.7. We choose 
the inner integration w.r.t. y first. 


[ txvtow ae dy= [Of cote" ay da 
Q 0 0 


=| eS) 


Therefore, c = 2. 


Figure 5.7: To integrate the probability P[O < Y < X], we perform a 2D integration over a triangle. 
The two subfigures show the two ways of integrating the triangle. [Left] [ dz first, and then f[ dy. 
[Right] f dy first, and then [ dz. 


5.1.5 Marginal PMF and marginal PDF 


If we only sum / integrate for one random variable, we obtain the PMF / PDF of the other 
random variable. The resulting PMF / PDF is called the marginal PMF / PDF. 


Definition 5.3. The marginal PMF 7s defined as 


px(z)= S> pxy(a,y) and py(y)= >> pxy(a,y), 


yEeQy wEQx 


and the marginal PDF is defined as 


Oe [ pev@uidy nd w i fev (oy) de, 


Since fx y (a, y) is a two-dimensional function, when integrating over y from —oo to oo, we 
project fx,y(x,y) onto the x-axis. Therefore, the resulting function depends on z only. 
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Example 5.7. Consider the joint PDF fx y(,y) = | shown below. Find the marginal 


PDFs. 


Solution. If we integrate over x and y, we have 


1, wil<27 <2, 
2, TP AR SS oS}, 
1 if3<a<4, 
0 


i otherwise. 


ihc = 2, 
if2<a<3, and fy(y) = 


otherwise. 


So the marginal PDFs are the projection of the joint PDFs onto the x- and y-axes. 


Practice Exercise 5.3. A joint Gaussian random variable (X,Y) has a joint PDF 


given by 
(Cael 2) \ 
Qo? ; 


1 
fxy(z,y) = a) exp { 


Find the marginal PDFs fx(x) and fy(y). 


Solution. 


fe(e) = f . fx(e.y) dy = f 


—co 


CO 


1 ( 
= exp 
V 270? { 
Recognizing that the last integral is equal to unity because it integrates a Gaussian 

PDF over the real line, it follows that 


fre) = Lew | 


Similarly, we have 
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5.1.6 Independent random variables 


Two random variables are said to be independent if and only if the joint PMF or PDF can 
be factorized as a product of the marginal PMF / PDFs. 


Definition 5.4. Random variables X and Y are independent if and only if 


px,y(z,y) =px(z) pyty), or fxy(z.y) = fx(e) fr(y). 


This definition is consistent with the definition of independence of two events. Recall that 
two events A and B are independent if and only if P[ANB] = P[A]P[B]. Letting A = {X = x} 
and B= {Y = y}, we see that if A and B are independent then P[X = xMY = y] is the 
product PX = a]P[Y = y]. This is precisely the relationship px y (x, y) = px (x) py (y). 


Example 5.8. Consider two random variables with a joint PDF given by 


Geta 


1 
fx y(z, y) = Ino exp { e2 
Are X and Y independent? 
Solution. We know that 


(= ele 


fxy(a,y) = : x 
ay = ex 
X,Y(®,Y om Pp Io2 


fx (x) 


Therefore, the random variables X and Y are independent. 


Practice Exercise 5.4. Let X be a coin and Y be a die. Then the joint PMF is given 
by the table below. 


Y 
3 
1 

2 


1 
2 


X=0 
X=1 
Are X and Y independent? 


Solution. For any x and y, we have that 


il 1 1 
Pxy(@,y) = 2 2 x 6 
wT 
px(x)  py(y) 


Therefore, the random variables X and Y are independent. 
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Example 5.9. Consider two random variables X and Y with a joint PDF given by” 


fx,y(@,y) « exp {—(x — y)?} = exp {—2? + 2ay — y7} 
= exp {—a?} exp {2zy} exp {-y’}. 
i at nd 


fx (a) extra term fy (y) 


This PDF cannot be factorized into a product of two marginal PDF's. Therefore, the 
random variables are dependent. 


*We use the notation “x” to denote “proportional to”. It implies that the normalization constant 


is omitted. 


We can extrapolate the definition of independence to multiple random variables. If 
there are many random variables X1, X9,...,Xy, they will have a joint PDF 


If these random variables X,,Xo,...,X,j are independent, then the joint PDF can be 
factorized as 


PR 3tw (x1, es tN) = fx, (x1) 2" (x2) sae ben (xy) 
N 
= |] fee). 
n=1 
This gives us the definition of independence for N random variables. 


Definition 5.5. A sequence of random variables X,,...,X Ny is independent if and 
only if their joint PDF (or joint PMF) can be factorized. 


(5.6) 


Example 5.10. Throw a die 4 times. Let X1, Xo, X3 and X4 be the outcomes. Then, 
since these four throws are independent, the probability mass function of any quadrable 
(11, £2, 3,24) is 


DX), Xo,X3,X4 (Sis 2,03, 64) = Px, (x1) PXo (x2) DPx3 (#3) Pa (x4). 


For example, the probability of getting (1,5, 2,6) is 


PX1,X2,X3,X4(1, 5,2, 6) = Px, (1) PX» (5) PX3 (2) PX, (6) = ( 


The example above demonstrates an interesting phenomenon. If the N random vari- 
ables are independent, and if they all have the same distribution, then the joint PDF/PMF 
is just one of the individual PDFs taken to the power N. Random variables satisfying this 
property are known as independent and identically distributed random variables. 
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Definition 5.6 (Independent and Identically Distributed (i.i.d.)). A collection of 
random variables X1,...,Xwn is called independent and identically distributed (1.i.d.) 


if 


e All Xy,...,Xwn are independent; and 
e All X1,...,Xn have the same distribution, t.e., fx,(@) =--- = fxn (2). 


If X,,..., Xn are ii.d., we have that 


N 
Pipa Bays sag) = [] fren), 
n=1 
where the particular choice of X, is unimportant because fx,(%) =--- = fx,(2). 


Why is i.i.d. so important? 


e If a set of random variables are i.i.d., then the joint PDF can be written as a 
product of PDFs. 


e Integrating a joint PDF is difficult. Integrating a product of PDFs is much easier. 


Example 5.11. Let X1, X2,..., Xy be a sequence of i.i.d. Gaussian random variables 
where each X; has a PDF 


which is a function depending not on the individual values of 71, 22,...,2£y but on the 
sum a x?. So we have “compressed” an N-dimensional function into a 1D function. 


Example 5.12. Let @ be a deterministic number that was sent through a noisy channel. 
We model the noise as an additive Gaussian random variable with mean 0 and variance 
a”. Supposing we have observed measurements X; = 6+ W;, for i =1,...,.N, where 
W;, ~ Gaussian(0, 07), then the PDF of each X; is 


ee 
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Thus the joint PDF of (X1, X2,...,Xwy) is 


(aa) EE" 


Essentially, this joint PDF tells us the probability density of seeing sample data 


5.1.7 Joint CDF 


We now introduce the cumulative distribution function (CDF) for multiple variables. 


Definition 5.7. Let X and Y be two random variables. The joint CDF of X and Y 
is the function Fx y(x,y) such that 


Fy y(z,y)=P[X<2nY<yl. (5.7) 


This definition can be more explicitly written as follows. 


Definition 5.8. If X and Y are discrete, then 


Py,y(z,y) = OS eee: 


OY Sy) ae ae 


If X and Y are continuous, then 


Fx y(z,y) = i i fxyv (ey) da’ dy’. 


If the two random variables are independent, then we have 


Fev(oy) = ff ” fala’) do’ - ” Ri POE. 


Example 5.13. Let X and Y be two independent uniform random variables 
Uniform(0, 1). Find the joint CDF. 


Solution. 


x y xc y 
Reeas | fx (a!) de’ ih pea = | hae | en 
0 0 0 0 
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Practice Exercise 5.5. Let X and Y be two independent uniform random variables 
Gaussian(j1,07). Find the joint CDF. 


Solution. Let ®(-) be the CDF of the standard Gaussian. 


Fy y (x,y) = Fx(x)Fy(y) 


ap fx(a’) dx’ 7 fy(y’) dy’ = o(—4 
Here are a few properties of the CDF: 


Fy y (a, —00 y= fe of fxy(2',y’) ad ay= [od = 
Fx y(—00, y) = - . fxy(2',y’) ad ay = [od 


Fy y(—0o, —oo) = ‘_ [- fxy(2',y’) dz’ dy = 0, 


1 | 


Fx y (co, 00) = / / fev (',y') da’ dy’ =1. 
In addition, we can obtain the marginal CDF as follows. 


Proposition 5.1. Let X and Y be two random variables. The marginal CDF is 


Proof. We prove only the first case. The second case is similar. 


Fx y(x,00) = [. [. fxy(2',y’) dy’ aa! = Fx(2z’) da’ = Fx(z). 


By the fundamental theorem of calculus, we can derive the PDF from the CDF. 


Definition 5.9. Let Fy y(x,y) be the joint CDF of X and Y. Then, the joint PDF 
18 


0 
Fxy(®¥) = 5 Fx »Y)- (5.12) 


The order of the partial derivatives can be switched, yielding a symmetric result: 


Ee? 
fxy(z,y) = Dr Oy MY OY): 
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Example 5.14. Let X and Y be two uniform random variables with joint CDF 
Fy y(a,y) = ay forO<a2<land0<y< 1. Find the joint PDF. 


Solution. 


ie Oo 
fxy (x,y) OxOy *,Y (2, y) andy” ’ 


which is consistent with the definition of a joint uniform random variable. 


Practice Exercise 5.6. Let X and Y be two exponential random variables with joint 
CDF 
Fxy(ey)=GQ—-e™)d—-e™), 220, y20. 


Find the joint PDF. 


Solution. 


o2 
fxy(z,y) = nay YY) = sa, 


which is consistent with the definition of a joint exponential random variable. 


5.2 Joint Expectation 


5.2.1 Definition and interpretation 


When we have a single random variable, the expectation is defined as 


x] = f afx(e) dx. 


For a pair of random variables, what would be a good way of defining the expectation? 
Certainly, we cannot just replace fx(x) by fx,y(x,y) because the integration has to be- 
come a double integration. However, if it is a double integration, where should we put the 
variable y? It turns out that a useful way of defining the expectation for X and Y is as 
follows. 


Definition 5.10. Let X and Y be two random variables. The joint expectation is 


EIXY]= 5) >> ay - pxy(z,y) (5:13) 


yEQy rEQx 
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if X and Y are discrete, or 


E[XY] = jm i xy: fxy(x,y) dx dy (5.14) 


if X and Y are continuous. Joint expectation is also called correlation. 


The double summation and integration on the right-hand side of the equation is nothing 
but the state times the probability. Here, the state is the product xy, and the probability is 
the joint PMF px y(z,y) (or PDF). Therefore, as long as you agree that joint expectation 


should be defined as E[XY], the double summation and the double integration make sense. 

The biggest mystery here is E[XY]. You may wonder why the joint expectation should 
be defined as the expectation of the product E[XY]. Why not the sum E[X + Y], or the 
difference E[|X —Y], or the quotient E[X/Y]? Why are we so deeply interested in X times Y? 
These are excellent questions. That the joint expectation is defined as the product has to do 
with the correlation between two random variables. We will take a small detour into linear 
algebra. 

Let us consider two discrete random variables X and Y, both with N states. So X 
will take the states {1,22,...,2y} and Y will take the states {y1, yo,..., yw}. Let’s define 


them as two vectors: « “= [71,-..,@y]? and y e [y1,---,yn]-. Since X and Y are random 
variables, they have a joint PMF px y(a#,y). The array of the PMF values can be written 
as a matrix: 


pxy (#1, 91) px,y(%1, Yo) ea px,y (£1, yn) 
def Px,y(@2,y1) Px,y(T2,y2) ++: px,y(x2, yn) 

PME as a matrix = P = : : ; 
px,y (IN, Y1) Px,y(IN, Yo) o> px,y (fn, yn) 


Let’s try to write the joint expectation in terms of matrices and vectors. The definition 
of a joint expectation tells us that 


N WN 
=S OS 0 aiy; - pxy(2i, yj); 
i=1 q=1 


which can be written as 


pxy(21,y1) ++: px,y(z1, yn) Yl 

XY |= [fi se ay] ele |) ae 
-— * Pxy(ZN,Y1) ++: Px,y(zn,yn)] Lyn 
P y 


This is a weighted inner product between x and y using the weight matrix P. 


Why correlation is defined as ELXY] 


e E[XY] is a weighted inner product between the states: 


XY] =a? Py. 
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e x and y are the states of the random variables X and Y. 


e The inner product measures the similarity between two vectors. 


Example 5.15. Let X be a discrete random variable with N states, where each state 
has an equal probability. Thus, px (a) = 1/N for all «. Let Y = X be another variable. 
Then the joint PMF of (X,Y) is 


pxy (x,y) = a 


It follows that the joint expectation is 


N WN 


[XY] = So a PX,Y (Cou) ew 


4=1 g=1 
Equivalently, we can obtain the result via the inner product by defining 
0 
0 
0 


In this case, the weighted inner product is 


How do we understand the inner product? Ignoring the matrix P for a moment, we 
recall an elementary result in linear algebra. 


Definition 5.11. Let 2 € R and y € R% be two vectors. Define the cosine angle 


cos @ as = 
Ce 5 (5.15) 


lz || lull 


where ||a|| = on ee is the norm of the vector x, and |\y|| = Megs y? is the 


norm of the vector y. 


This definition can be understood as the geometry between two vectors, as illustrated in 
Figure 5.8. If the two vectors # and y are parallel so that « = ay for some a, then the 
angle 0 = 0. If x and y are orthogonal so that «7? y = 0, then 0 = 7/2. Therefore, the inner 
product x7 y tells us the degree of correlation between the vectors « and y. 
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: T 
with states 7 = (a1, Saxe gal Nn] Distributions 


E[XY] 7 x’ Pxyy Px (x) or Px 
VE[X2JE[Y2] — |lzIlPx Ilylley 


px,y (x,y) or Pxy 
Y: 


with states y = [yi,---,yn]- py (y) or Py 


Figure 5.8: The geometry of joint expectation. E[X Y] gives us the cosine angle between the two random 
variables. This, in turn, tells us the correlation between the two random variables. 


Now let’s come back to our discussion about the joint expectation. The cosine angle 
definition tells us that if E[XY] = #7 Py, the following form would make sense: 


x’ Py E[XY] 


cos? = Tella ~ Tellall 


That is, as long as we can find out the norms ||a|| and ||y||, we will be able to interpret 
’| XY] from the cosine angle perspective. But what would be a reasonable definition of ||a|| 
and ||y||? We define the norm by first considering the variance of the random variable X 
and Y: 


N 
o[X?] = So aia - px (xi) 
Pei 


px(%1) °°: 0 Ly 
= [21 tn]) 
av 0 vee px (xn) LN 
a 
Px x 


= 27 Pxx = |lz\|p,, 


where Px is the diagonal matrix storing the probability masses of the random variable X. 
It is not difficult to show that Px = diag(P1) by following the definition of the marginal 
distributions (which are the column and row sums of the joint PMF). Similarly we can define 


N 
[Y?] = So yyy - py (ys) 
j=l 
py(yi) ++ 0 M1 
= [ya yn | : : : : 
or 0 “+ py(yn)} Lyn 
pe ae 
Py y 


=y" Pyy =|lyllz,. 
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Therefore, one way to define the cosine angle is to start with 
a’ Pxyy 
lz Px \lyllpy’ 


where Pxy = P, |la||p, = /a7? Px and |ly||py = /y? Pry. But writing it in terms of 
the expectation, we observe that this cosine angle is exactly 


cos? = 


cos 9 = ————_~_ 
lz ||P xllyllpy 


Therefore, ELX Y] defines the cosine angle between the two random variables, which, in turn, 
defines the correlation between the two. A large |E[XY]| means that X and Y are highly 
correlated, and a small |E[XY]|] means that X and Y are not very correlated. If E[XY] = 0, 
then the two random variables are uncorrelated. Therefore, E[XY] tells us how the two 
random variables are related to each other. 


To further convince you that Ta EA can be interpreted as a cosine angle, we 
show that 
LX Y 
1< ai, 
1X?) VEY) 


because if this ratio can go beyond +1 and —1, it makes no sense to call it a cosine angle. 

The argument follows from a very well-known inequality in probability, called the Cauchy- 
: ; ; a ; 2 24, 

Schwarz inequality (for expectation), which states that —1< Jax Je 1: 


Theorem 5.2 (Cauchy-Schwarz inequality). For any random variables X and Y, 


(E[XY])? < E[X°JE[Y?]. (5.16) 


The following proof can be skipped if you are reading the book the first time. 


Proof. Let t € R be a constant. Consider 


u[(X + #4V)7] = E[X? + 2XY + HY"). 


Since E[(X + tY)?] > 0 for any t, it follows that 


u[X? + 2XY + t?Y?] > 0. 


Expanding the left-hand side yields 


?E[Y?] + 2HE[XY] + E[X?] > 0. 
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This is a quadratic equation in t, and we know that for any quadratic equation at?-++bt+c > 0 
we must have b? — 4ac < 0. Therefore, in our case, we have that 


(2E[XY])? — 4E[Y?]E[X?] <0, 


which means (E[XY])? < E[X?]E[Y?]. The equality holds when E[(X + tY)?] = 0. In this 
case, X = —tY for some t, i.e., the random variable X is a scaled version of Y so that the 


vector formed by the states of X is parallel to that of Y. 


End of the proof. 


5.2.2 Covariance and correlation coefficient 


In many practical problems, we prefer to work with central moments, i.e., E[(X — x )?] in- 
stead of E[X]. This essentially means that we subtract the mean from the random variable. 
If we adopt such a centralized random variable, we can define the covariance as follows. 


Definition 5.12. Let X and Y be two random variables. Then the covariance of X 
and Y is 


Cov(X, Y) = 


where ux = E[X] and py = EY]. 


a[(X — ux)(Y — py], (5.17) 


It is easy to show that if X = Y, then the covariance simplifies to the variance: 


Cov(X, X) = E[(X — ux)(X — wx) = Var[X]. 


Thus, covariance is a generalization of variance. The former can handle a pair of variables, 
whereas the latter is only for a single variable. We can also demonstrate the following result. 


Theorem 5.3. Let X and Y be two random variables. Then 


Cov(X, Y) = 


E[XY] — 


E[X]E[Y]. 


Proof. Just apply the definition of covariance: 


Cov(X,Y) = E[(X — px) (Y — py)| 


= ELXY — Xpy —Yux + uxpy] = E[XY] — pxpy. 


The next theorem concerns the sum of two random variables. 


Theorem 5.4. For any X and Y, 


a. E[X +Y] =E[X]+E(y]. 


b. VarLX + Y] = Var[X] + 2Cov(X, Y) + Var[Y]. 


262 


5.2. JOINT EXPECTATION 


Proof. Recall the definition of joint expectation: 


IX +¥]= >) (at y)pxy(a,y) 
= ~. S> apx,y (a, y)+ eee ypx,y (x,y) 


= a: (Seve) ae s y (Sexve ») 
= So axpx(x) + — ypy(y) 


= E[X] + E[Y]. 


Similarly, 


Var[X + Y] =E[(X + Y)*] -E[X +Y/]? 


+ Y)*] — (ux + py)? 


= B[X?] =p + EY] — p2 + 2(B[XY] — pxpy) 
= Var|X] + 2Cov(X, Y) + Var[Y]. 


With covariance defined, we can now define the correlation coefficient , which is the 
cosine angle of the centralized variables. That is, 


p =cosé 


B[(X — ux)(Y — py) 
VE[(X — ux)? JE[(Y — py)?] 
Recognizing that the denominator of this expression is just the variance of X and Y, we 
define the correlation coefficient as follows. 


Definition 5.13. Let X and Y be two random variables. The correlation coefficient 
1s 


= _ Cov(X¥) (5.19) 


\/Var[X]Var[Y] 


Since —1 < cos@ < 1, p is also between —1 and 1. The difference between p and E[XY] 
is that p is normalized with respect to the variance of X and Y, whereas E[XY] is not 
normalized. The correlation coefficient has the following properties: 


e p is always between —1 and 1, ie., -—1 < p < 1. This is due to the cosine angle 
definition. 


e When X =Y (fully correlated), p = +1. 
e When X = —Y (negatively correlated), p = —1. 
e When X and Y are uncorrelated, p = 0. 
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5.2.3. Independence and correlation 


If two random variables X and Y are independent, the joint expectation can be written as 
a product of two individual expectations. 


Theorem 5.5. If X and Y are independent, then 


E[XY] = E[X]E 


Proof. We only prove the discrete case because the continuous can be proved similarly. If 
X and Y are independent, we have px y (x,y) = px (x) py(y). Therefore, 


Y= DD tury (a9) = 2d awrx(a) (y) 


= (s srx(o) " mv) - 


In general, for any two independent random variables and two functions f and g, 


{Lf(X)9(Y)] = ELFXVJEL@)]. 


The following theorem illustrates a few important relationships between independence 
and correlation. 


& 


)XJE[Y]. 


Theorem 5.6. Consider the following two statements: 
a. X and Y are independent; 
b. Cov(X,Y) = 0. 


Statement (a) implies statement (b), but (b) does not imply (a). Thus, independence 
is a stronger condition than correlation. 


Proof. We first prove that (a) implies (b). If X and Y are independent, then E[XY] = 
f|X]E[Y]. In this case, 


Cov(X, Y) = ELXY] — E[XJE[Y] = E[X]E[Y] — E[X]E[Y] = 0. 


To prove that (b) does not imply (a), we show a counterexample. Consider a discrete 
random variable Z with PMF 


Let X and Y be = 
X = cos an and Y =sin a” 
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Then we can show that E[X] = 0 and E[Y] = 0. The covariance is 


(X — 0)(Y — 0)] 
= 3 [cos © —Zsin +2 


-| 


1 1 1 1 1 
5 (sin0) + (sin m1)7 + (sin m2)7 + (sin m3)7 =0. 


Cov(X,Y) =E| 


sin xz| 


NlR 


The next step is to show that X and Y are dependent. To this end, we only need to show 
that px y (x,y) 4 px(x)py(y). The joint PMF px y(x, y) can be found by noting that 


Z=0=>X=1, Y =0, 
Z=1=>X=0, Y = 1, 
Z=2=>X=-1, Y=0, 
Z=3=>X=0, Y=-1 
Thus, the PMF is 
0 4 0 
PX y (2, y) _ i p 7 
Dig O 
The marginal PMFs are 
px(@)=[2 4H, mrW=[R 4 4 
The product px (x) py (y) is 
at i 21 
16 8 16 
px(t)py(y)=|e 4 8 
a d12 1 
16 8 16 


= 
f 

x 

ais 


Therefore, pxy (x,y) # px(x)py(y), although E 


What is the relationship between independent and uncorrelated? 


e Independent => uncorrelated. 


e Independent < uncorrelated. 


5.2.4 Computing correlation from data 


We close this section by discussing a very practical problem: Given a dataset containing two 
columns of data points, how do we determine whether the two columns are correlated? 
Recall that the correlation coefficient is defined as 
oLXY] — wxpy 
OxOy : 


p= 
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If we have a dataset containing (2,,,y;)N_,, then the correlation coefficient can be approxi- 


mated by 


n=1) 


ap Seid sit at 


? 
7)2 
oa Vin=1(@n — @) plex Vin=1 (Yn — ¥) 


where 7 = y Le -1% and y= y :. 1 Yn are the means. This equation should not be a 
surprise because essentially all terms are the empirical estimates. Thus, p is the empirical 
correlation coefficient determined from the dataset. As N — oo, we expect p — p. 


20°" 


@ 
: 2 3 
5 4-3-2 -1012 3 4 5 5 4-3-2 -10%123 45 5 -4-3-2-1012 3 4 5 
(a) p = —0.0038 (b) p= 0.5321 (c) p = 0.9656 


Figure 5.9: Visualization of correlated variables. Each of these figures represent a scattered plot of a 
dataset containing (an, Yn )h=1.- (a) is uncorrelated. (b) is somewhat correlated. (c) is strongly correlated. 


Figure 5.9 shows three example datasets. We plot the (x,, Yn) pairs as coordinates in 
the 2D plane. The first dataset contains samples that are almost uncorrelated. We can see 
that x, does not tell us anything about y,. The second dataset is moderately correlated. 
The third dataset is highly correlated: If we know x,, we are almost certain to know the 
corresponding yy, with a small number of perturbations. 

On a computer, computing the correlation coefficient can be done using built-in com- 
mands such as corrcoef in MATLAB and stats.pearsonr in Python. The codes to gen- 
erate the results in Figure 5.9(b) are shown below. 


% MATLAB code to compute the correlation coefficient 
x = mvnrnd([0,0],[3 1; 1 1],1000); 

figure(1); scatter(x(:,1),x(:,2)); 

rho = corrcoef (x) 


# Python code to compute the correlation coefficient 


import numpy as np 

import scipy.stats as stats 

import matplotlib.pyplot as plt 

x = stats.multivariate_normal.rvs([0,0], [[£3,1],[1,1]], 10000) 
plt.figure(); plt.scatter(x[:,0],x[:,1]) 

rho,_ = stats.pearsonr(x[:,0],x[:,1]) 

print (rho) 
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5.3. Conditional PMF and PDF 


Whenever we have a pair of random variables X and Y that are correlated, we can define 
their conditional distributions, which quantify the probability of X = x given Y = y. In 
this section, we discuss the concepts of conditional PMF and PDF. 


5.3.1 Conditional PMF 


We start by defining the conditional PMF for a pair of discrete random variables. 


Definition 5.14. Let X and Y be two discrete random variables. The conditional 
PME of X given Y is 


px,y(@,y) 
py (y) , 


pxiy (aly) = (5.21) 


The simplest way to understand this is to view pxyy(a|y) as PLX = x|Y = yj. That is, 
given that Y = y, what is the probability for X = x? To see why this perspective makes 
sense, let us recall the definition of a conditional probability: 


PX,Y (x, y) 
py (y) 
PIX =aN Y=y] 
a =P[X =a2|Y =y). 
PiY =y] 
As we can see, the last two equalities are essentially the definitions of conditional probability 
and the joint PMF. 


How should we understand the notation px \y (ay)? Is it a one-variable function in « or 
a two-variable function in (a, y)? What does px)y(x|y) tell us? To answer these questions, 
let us first try to understand the randomness exhibited in a conditional PMF. In px|y (aly), 
the random variable Y is fixed to a specific value Y = y. Therefore there is nothing random 
about Y. All the possibilities of Y have already been taken care of by the denominator 
py (y). Only the variable x in px y(x|y) has randomness. What do we mean by “fixed at a 
value Y = y”? Consider the following example. 


pxiy (aly) = 


Example 5.16. Suppose there are two coins. Let 


X =the sum of the values of two coins, 


Y =the value of the first coin. 


Clearly, X has 3 states: 0, 1, 2, and Y has two states: either 0 or 1. When we say 
px y(2|1), we refer to the probability mass function of X when fixing Y = 1. If we do 
not impose this condition, the probability mass of X is simple: 


ep 
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However, if we include the conditioning, then 


pxy (x, 1) 
py (1) 
AA fo28 


a 1 29? 
a 3.3 


pxjy(2|1) = 


To put this in plain words, when Y = 1, there is no way for X to take the state 0. The 
chance for X to take the state 1 is 2/3 because either (0,1) or (1,0) can give X = 1. 
The chance for X to take the state 2 is 1/3 because it has to be (1,1) in order to give 
X = 2. Therefore, when we say “conditioned on Y = 1”, we mean that we limit our 
observations to cases where Y = 1. Since Y is already fixed at Y = 1, there is nothing 
random about Y. The only variable is X. This example is illustrated in Figure 5.10. 


Px|y (2|1) 


Figure 5.10: Suppose X is the sum of two coins with PMF 0.25, 0.5, 0.25. Let Y be the first coin. 
When X is unconditioned, the PMF is just [0.25,0.5, 0.25]. When X is conditioned on Y = 1, 
then “X = 0" cannot happen. Therefore, the resulting PMF px\y(2|1) only has two states. After 
normalization we obtain the conditional PMF [0, 0.66, 0.33]. 


Since Y is already fixed at a particular value Y = y, pxjy(a|y) is a probability mass 
function of x (we want to emphasize again that it is x and not y). So px|y(zly) is a one- 
variable function in «. It is not the same as the usual PMF px (2). px|y(aly) is conditioned 
on Y = y. For example, pxjy(«|1) is the PMF of X restricted to the condition that Y = 1. 
In fact, it follows that 


by Pxiy (aly) = > a 


rEQx tEQx 
_ Leeox Cee _ py(y) 
py (y) py (y) 


but this tells us that px|y (xy) is a legitimate probability mass of X. If we sum over the y’s 
instead, then we will hit a bump: 


S- pxy(z|y) = S- ae # 1. 


yEQy yEQy 


Therefore, while px)y (aly) is a legitimate probability mass function of X, it is not a prob- 
ability mass function of Y. 
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Example 5.17. Consider a joint PMF given in the following table. Find the conditional 
PMF pxjy(2|1) and the marginal PMF px (2). 


4 
Z= i? px) = Yop 


y=l 


Te Aes 


pxy(a,1) _ 
py (1) 


pxiy (a1) = 


Practice Exercise 5.7. Consider two random variables X and Y defined as follows. 
lossy, with prob 1/2 
10? ith b 5/6 i 
y= fe liens an i ¢ X=} 108Y, with prob 1/3, 
tee inne ta 10-2Y, with prob 1/6. 


Find px|y (zy), px (x) and px,y (x,y). 
Solution. Since Y takes two different states, we can enumerate Y = 10? and Y = 10+. 
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This gives us 


px y (x|107) S 


px y (2|10*) = 


The joint PMF px y (a, y) is 


— 


px,y (x, 10°) = pxyy (2|107)py (107) = 


LENO 
Dlr WIR NiI- 
a 
La 

DHOOM aM]or 


—— 
~ 
—— 

| 


px,y (x, 10*) = px|y(x|10*)py (10*) = 


OlR WIR |e 
“—" 
La 

ROAR OR 


—— 
Las 
| 


104 0 
10? | 3 = 


| 0.01 0.1 


The marginal PMF px (a) is thus 


px (2) =) px,y(a,y) 


In the previous two examples, what is the probability PLX € A|Y = y] or the proba- 
bility PLX € A] for some events A? The answers are giving by the following theorem. 


Theorem 5.7. Let X and Y be two discrete random variables. Let A be an event. 


P[IX E€A|Y=y]= S> pxyy (aly) 
rEA 


PIX €Al= > > pxyy(aly)py(y) = D> PIX € A|Y = ylpy(y). 


wEA yEQy yEeQy 


Proof. The first statement is based on the fact that if A contains a finite number of elements, 
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then P[X € A] is equivalent to the sum }¢,- 4 P[X = a]. Thus, 


PIX € ANY =y] 
P[Y =y] 
7 Maca PIX =2NY =y| 
P[Y = y] 
= S- pxy (aly). 


ZEA 


PIX € AJY =y]= 


The second statement holds because the inner summation yeas px\y (zly)py (y) is just 
the marginal PMF px(a). Thus the outer summation yields the probability. 


Example 5.18. Let us follow up on Example 5.17. What is the probability that 
P[X > 2|Y = 1]? What is the probability that P|X > 2]? 


Solution. Since the problem asks about the conditional probability, we know that it 
can be computed by using the conditional PMF. This gives us 


Exe 2a — bl y px\y (2|1) 
= prepetHtT) + pep OMT] + pxyy (311) + pxyy (All) = 5. 
ee 


i 0 
3 


The other probability is 


PIX > 2) = Sopx() <a T+ ae 2T + px(3) + px(4) =o 


oe Se 20 


What is the rule of thumb for conditional distribution? 
e The PMF/PDF should match with the probability you are finding. 


e If you want to find the conditional probability PLX € A|Y = y], use the condi- 
tional PMF pxyy (ly). 


e If you want to find the probability P[|X € A], use the marginal PMF px(z). 


Finally, we define the conditional CDF for discrete random variables. 


Definition 5.15. Let X and Y be discrete random variables. Then the conditional 
CDF of X given Y = y is 


Fx iy (aly) =P[X <2|Y=y] = S- pxly (2'ly). (5.22) 


DoS, 


271 


CHAPTER 5. JOINT DISTRIBUTIONS 


5.3.2 Conditional PDF 


We now discuss the conditioning of a continuous random variable. 


Definition 5.16. Let X and Y be two continuous random variables. The conditional 
PDF of X given Y is 
fxy(x,y) 


fr(y) ae 


Fxiy (aly) = 


Example 5.19. Let X and Y be two continuous random variables with a joint PDF 


2e-*e 4, O<y<4<um, 
0, otherwise. 


fxy (a, y) = 


Find the conditional PDFs fx;y(a|y) and fy;x(y|@). 


Solution. We first find the marginal PDFs. 


f(a) = [ fry(o.y) dy = | “De-*e-¥ dy = 2e"*(1 — €-*), 


= / fxy(z,y) dz = i Qe-*e-¥ dx = 2e~7Y 
eee 


Thus, the conditional PDF's are 


fxiy (aly) = 1a, ae v) 


fyix(ylz) = foe 


© 2e-“e 9 = 
7 De" ae-= 


Where does the conditional PDF come from? We cannot duplicate the argument 
we used for the discrete case because the denominator of a conditional PMF becomes 
P[Y = y] = 0 when Y is continuous. To answer this question, we first define the conditional 
CDF for continuous random variables. 


Definition 5.17. Let X and Y be continuous random variables. Then the conditional 
CDF of X given Y =y is 


Hes fxy(z ‘,y) da’ 
fy(y) ; 


Fyiy (zy) = (5.24) 
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Why should the conditional CDF of continuous random variable be defined in this way? One 
way to interpret F'y)y (a|y) is as the limiting perspective. We can define the conditional CDF 
as 


Fxyy (zly) = lim P(X <2|\y<Y<y+h) 


. Me Sanys ¥ Sy h) 
= lim 
h—0 Ply <Y <y+t+hj 


With some calculations, we have that 


R(X <senys¥syth)_ oll fxvle'y se 


mm Ply<Y <y+Al h-+0 i's fy(y’) 
. iS. fxy(« y’) dx’ - h 
= lim 
h—>0 fr(y yh 
fe fac (a's y!) de! 
fy (y) 


The key here is that the small step size h in the numerator and the denominator will 
cancel each other out. Now, given the conditional CDF, we can verify the definition of the 
conditional PDF. It holds that 


d 
fxiy (aly) = axl (ly) 


_ 4 flo fxy@y) d"| @ fxy(ey) 
dx fy (y) fry) ? 


where (a) follows from the fundamental theorem of calculus. 


Just like the conditional PMF, we can calculate the probabilities using the conditional 
PDFs. In particular, if we evaluate the probability where X € A given that Y takes a 
particular value Y = y, then we can integrate the conditional PDF fx)y(a|y), with respect 
to x. 


Theorem 5.8. Let X and Y be continuous random variables, and let A be an event. 


(i) PIX €A|Y =y] = J, fxyy (aly) dz, 
(ii) P[X € A] = Jo, PIX € A|Y =y]fr(y) dy 


Example 5.20. Let X be a random bit such that 


ye +1, with prob 1/2, 
~ )=1, with prob 1/2. 


Suppose that X is transmitted over a noisy channel so that the observed signal is 


Y=X+N, 
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where N ~ Gaussian(0, 1) is the noise, which is independent of the signal X. Find the 
probabilities PIX = +1|Y > 0] and P[X =—1|Y > 0]. 


Solution. First, we know that 


frixtyl+) =e 
=. -—————— 
mene Van 


Therefore, integrating y from 0 to oo gives us 


and fyix(y| -l)j= 


Similarly, we have P[Y > 0| X = —1] = 1— (+41). The probability we want to find is 
P[X =+1]|Y > 0], which can be determined using Bayes’ theorem. 


PIY > 0|X =+1)PLX = +1] 


P[X =+1|Y >0] = PY > 0] 


The denominator can be found by using the law of total probability: 


PIY > 0] =P[Y > 0| X = 41]P[X = 4] 
+P[Y > 0|X = -1]P[X =-]] 


2 oeeeece ty 


2) 
il 


=5 
since ®(+1) + 6(—1) = 6(4+1) + 1 — 6(41) = 1. Therefore, 


P[X =+41|Y > 0] =1-(-1) 
= 0.8413. 


The implication is that if Y > 0, the probability PLX = +1|Y > 0] = 0.8413. The 
complement of this result gives PLX = —1|Y > 0] = 1 — 0.8413 = 0.1587. 


Practice Exercise 5.8. Find P[Y > y], where 


X ~ Uniform[1,2], Y |X ~ Exponential(z). 


Solution. The tricky part of this problem is the tendency to confuse the two variables 
X and Y. Once you understand their roles the problem becomes easy. First notice that 
Y |X ~ Exponential(x) is a conditional distribution. It says that given X = 2, the 
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probability distribution of Y is exponential, with the parameter x. Thus, we have that 
fyix(yl@) = ve~*. 


Why? Recall that if Y ~ Exponential(A) then fy(y) = \e7*¥. Now if we replace 
with x, we have ze”. So the role of x in this conditional density function is as a 
parameter. 


Given this property, we can compute the conditional probability: 


Ie ee eave ra] =| fy|x(y'|x) dy’ 
y 


co 
y 


Finally, we can compute the marginal probability: 
PY S34) = i Py S0Ly =a ee ax 
Qx 


7 =i (1-e"). 


We can double-check this result by noting that the problem asks about the probability 
P[Y > y]. Thus, the answer must be a function of y but not of a. 


5.4 Conditional Expectation 


5.4.1 Definition 


When dealing with two dependent random variables, at times we would like to determine 
the expectation of a random variable when the second random variable takes a particular 
state. The conditional expectation is a formal way of doing so. 


Definition 5.18. The conditional expectation of X given Y = y is 


E[X|Y =y] = D_apxiv( (xly) 


for discrete random variables, and 


HLS LY —3| =| afxiy (aly) dx 


for continuous random variables. 
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There are two points to note here. First, the expectation of ELX | Y = y] is taken with respect 
to fx|y (aly). We assume that the random variable Y is already fixed at the state Y = y. 
Thus, the only source of randomness is X. Secondly, since the expectation ELX | Y = y] has 
eliminated the randomness of X, the resulting function is in y. 


What is conditional expectation? 


e E[X|Y = y] is the expectation using fx)y (ay). 


e The integration is taken w.r.t. x, because Y = y is given and fixed. 


5.4.2 The law of total expectation 


Theorem 5.9. The law of total expectation states that 


[X|Y =ylpy(y), or E [X|Y = ylfy(y) dy. (5.27) 


Proof. We will prove the discrete case only, as the continuous case can be proved by replacing 
summation with integration. 


|X] _ S> xpx (2) = S- x (Srxve ») 
=S7¥ apxyy (2ly)py (y) 


=, (= crx) py (y) = >, E[X|Y = ylpy(y). 


Figure 5.11 illustrates the idea behind the proof. Essentially, we decompose the expec- 
tation E[X] into “subexpectations” E[X|Y = y]. The probability of each subexpectation is 
py (y). By summing the subexpectation multiplied by py (y), we obtain the overall expecta- 
tion. 


What is the law of total expectation? 


e The law of total expectation is a decomposition rule. 


e It decomposes E[X] into smaller/easier conditional expectations. 


This law can also be written in a more compact form. 


Corollary 5.1. Let X and Y be two random variables. Then 


hy [ Ex\y[X|¥]] : 
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E[X|Y =-ya] 
MA|Y = ys] 
py (y) 
E[X|Y =y] ° yy Y2 YB OA 
X|Y =y 


Figure 5.11: The expectation E[X] can be decomposed into a set of subexpectations. This gives us 
E[X] = 97, E[X|Y = ylpy(y). 


Proof. The previous theorem states that EX] = }>, E[X|Y = y]py (y). If we treat ELX|Y = 
y| as a function of y, for instance h(y), then 


1X] = )UELXIY = ylpy(y) = 90 h(y)py (y) = E[A(Y)] = E[E[XIY]]. 


Example 5.21. Suppose there are two classes of cars. Let X be the speed of a car 
and C be the class. When C = 1, we know that X ~ Gaussian({11,01). We know that 
P[C = 1] =p. When C = 2, X ~ Gaussian(j2, 2). Also, P|C = 2] = 1— p. If you see 
a car on the freeway, what is its average speed? 


Solution. The problem has given us everything we need. In particular, we know that 
the conditional PDFs are: 


ye k2 
fxic(« [y= a exp { (x sae \ 
1 


Rea= es an | 


1 
exp 
/ 2x08 { 205 


Therefore, conditioned on C’, we have two expectations: 


UX|C=1= fe fxyo(elt) de= my, 


CO 


EIX|C=2= [ 


 Fxjo(|2) da = pur. 


The overall expectation E[X] is 


Xx] =E[X|C = 1P[C = 1] +E[X|C =2\Pic =2] 
pi + (1 — p) pa. 
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Practice Exercise 5.9. Consider a joint PMF given by the following table. Find 
aa — 02 Pandy Heys — 105). 


Solution. To find the conditional expectation, we first need to know the conditional 
PMF. 


pxy(#|107) = [5 5 
pxjy(«|10*) = [0 0 


Therefore, the conditional expectations are 


1 = 104] = a2) (5 


Ix |¥ = 104] = a ( 


123 
==. 


From the conditional expectations we can also find E 


a[X] = E[X | Y = 10" ]py (107) 
+ E[X | Y = 10*]py (10°) 


= (Gos) (6) + Ce) 6) 


= 3.5875. 


Example 5.22. Consider two random variables X and Y. The random variable X 
is Gaussian-distributed with X ~ Gaussian(j,07). The random variable Y has a 
conditional distribution Y|X ~ Gaussian(X, X7). Find E[Y]. 


Solution. The notation Y|X ~ Gaussian(X, X?) means that given the variable X, 
the other variable Y has a conditional distribution Gaussian(X,X?). That is, the 
variable Y is a Gaussian with mean X and variance X?. How can the mean be a 
random variable X and the variance be another random variable X?? Because X is 
the conditional variable. Y|X means that you have already chosen one state of X. 
Given that particular state, the distribution of Y follows fy;x. Therefore, for this 
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problem, we know the PDFs: 


1 


Vv ee 
1 


fy|x(ylz) = ar Bik 


The conditional expectation of Y given X is 


a il 
Vix =a) = / exp { 
[Y| ] Lge 


= E[Gaussian(z, x)| = x. 


fx(x) = 


The last equality holds because we are computing the expectation of a Gaussian ran- 
dom variable with mean «. Finally, applying the law of total expectation, we can show 
that 


l: 


where the last equality is based on the fact that it is the mean of a Gaussian. 


Gaussian (i 


Practice Exercise 5.10. Find E[sin(X + Y)], if X ~ Gaussian(0,1), and Y|X ~ 
Uniform|x — 7, x + 7]. 


Solution. We know that the conditional density is 


1 
—— —-n7eay< S 
fy |x (y|z) oe e-a<y<art7 
Therefore, we can compute the probability 
a+T 


ssin(X + YK =a]= f  sin@e+y)frix(ole) dy 


Us 


1 v+nT 
= sin(a + y) dy = 
Ie 
=0 


Hence, the overall expectation is 


Tenney ie [ i[sin(X + Y)|X =a] 
—(@) 
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5.5 Sum of Two Random Variables 


One typical problem we encounter in engineering is to determine the PDF of the sum of 
two random variables X and Y, i.e., X + Y. Such a problem arises naturally when we want 
to evaluate the average of many random variables, e.g., the sample mean of a collection of 
data points. This section will discuss a general principle for determining the PDF of a sum 
of two random variables. 


5.5.1 Intuition through convolution 


First, consider two random variables, X and Y, both discrete uniform random variables 
in the range of 0,1,2,3. That is, px(x) = py(y) = [1/4,1/4, 1/4, 1/4]. Since this is such a 
simple problem we can enumerate all the possible cases of the sum Z = X+Y. The resulting 
probabilities are shown in the following table. 


Z=X+Y Cases, written in terms of (X, Y) Probability 


0 (0,0) 1/16 
1 (0,1), (1,0) 2/16 
2 (1,1), (2,0), (0,2) 3/16 
3 (3,0), (2,1), (1,2), (0,3) 4/16 
4 (3,1), (2,2), (1,3) 3/16 
5 (3,2), (2,3) 2/16 
6 (3,3) 1/16 


Clearly, the PMF of Z is not fz(z) = fx(x)+ fy (y). (Caution! Do not write this.) The 
PMF of Z looks like a triangle distribution. How can we get to this triangle distribution 
from two uniform distributions? The key is the idea of convolution. Let us start with the 
PMF of X, which is px (a). Let us also flip py (y) over the y-axis. As we shift the flipped py, 
we multiply and add the PMF values as shown in Figure 5.12. This gives us 


pz(0) =P[X+Y =0] 
= P[(X, Y) = (0,0)| 
= px (0)py (0) 
_ il 
~ 16° 
Now, if we shift towards the right by 1, we have 


pz(1) =PIX+¥ =1] 
= PU(X,Y) = (0,1)UO,D) 
2 


= px (0)py (1) + px (1)py (0) = rT 


By continuing our argument, you can see that we will obtain the same PMF as the one 
shown in the table. 


5.5. SUM OF TWO RANDOM VARIABLES 


px (2) px (x) 


X +Y =1: (0,1) or (1,0) 


Figure 5.12: When summing two random variables X and Y, we are effectively taking the convolutions 
of the two respective PMF / PDFs. 


5.5.2. Main result 


We can show that for any arbitrary random variable X and Y, the sum Z = X + Y hasa 
distribution that is the convolution of two individual PDFs. 


Theorem 5.10. Let X and Y be two independent random variables with PDFs fx(«) 
and fy(y) respectively. Let Z = X +Y. The PDF of Z is given by 


fale) = (fx hoyle = ff ee On (5.29) 


(8, SE 
ok 


where denotes the convolution. 


Proof. We begin by analyzing the CDF of Z. The CDF of Z is 
Fz(z) =P[Z < z] =P[X+Y < zl. 


We now draw a picture to illustrate the line under which we want to integrate. As shown in 
Figure 5.13, the equation X + Y < z defines a straight line in the xy plane. You can think 
of it as Y < —X +z, so that the slope is —1 and the y-intercept is z. 


Now, shall we take the upper half of the triangle or the lower half? Since the equation 
is Y < —X +z, a value of Y has to be less than that of the line. Another easy way to check 
is to assume z > 0 so that we have a positive y-intercept. Then we check where the origin 
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outer = [ (inner)dy 


—co 


Figure 5.13: The shaded region highlights the set X + Y < Z. To integrate the PDF over this region, 
we first take the inner integration over dx and then take the outer integration over dy. 


(0,0) belongs. In this case, if z > 0, the origin (0,0) will satisfy the equation Y < —X + z, 


and so it must be included. Thus, we conclude that the area is below the line. 
Once we have determined the area to be integrated, we can write down the integration: 


co Zz-y 
PIx+¥ <= ff fev(ey)dedy 


= [. a fx(x)fy(y) dx dy, (independence) 


where the integration limits are just a rewrite of X +Y < z (in this case since we are 
integrating x first we have X < —Y + z). Then, by the fundamental theorem of calculus, 
we can show that 


fel2)= ZFe()= 2 ff  txa)fvu) av ay 


+i (a [ _, x@) fr) ar) dy 
~ a fx (z—y)fy(y) dy = (fx * fy)(2), 


669? 


where “«” denotes the convolution. 


How is convolution related to random variables? 


e If you sum X and Y, the resulting PDF is the convolution of fx and fy. 


e E.g., convolving two uniform random variables gives you a triangle PDF. 


5.5.3. Sum of common distributions 


Theorem 5.11 (Sum of two Poissons). Let X; ~ Poisson(\1) and X2 ~ Poisson(A2). 
Then 


X1+ X_ ~ Poisson(A; + A2). (5.30) 


282 


5.5. SUM OF TWO RANDOM VARIABLES 


Proof. Let us apply the convolution principle. 


py (k) = PIX +X =k] 
=P[X,=£N X,=k-G 


lI 
Je 
- 
fo 
3S oF 
= 
_ 
= 
| 
3 
a) 


1 ok! 
— po (i+A2) , + i+ £\k-2 
° ee ako 


=Dleo (2) A1»2“ 


= (Ar + da)" Obra) 


k! , 


where the last step is based on the binomial identity ae (*) a’bk-§ = (a +)". 


Theorem 5.12 (Sum of two Gaussians). Let X, and X2 be two Gaussian random 
variables such that 


Xy~ Gaussian(11, 07) and Xow Gaussian([12, 03). 


X,4+ Xo ~ Gaussian(p1 + 2,07 + 03). 


Proof. Let us apply the convolution principle. 


fate) = f * POME-Da 


1 (t — pu)? 1 (z—t — pe)? 
_ dt 
- Qo? a { 20? V 2102 = 20? 


1 ee 7 t— yn)? —t— pe)? 
= exp ( bay’ + (@ Ha) dt. 
V2102 J_o V210? 20? 


We now complete the square: 


(t — pr)? + (2 —t — po)” = [f° — Qpat + pi] + [t? + 2t(u2 — z) + (ua — 2)? 
= 2? — 2t(uy — we +2) + WE + (M2 — 2)” 


5 | +18 + G2? 


2 
= 2h ed 2m tets 


2 
H Sich ag Ae 
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The last term can be simplified to 


ja — Ha +2)" 
17 #2 
2| 5 + pt + (u2 — 2)? 
2 2 
by — 21 (2 — Zz) + (M2 — 2) 
: 5 t UE + (He — 2)? 
_ HE + 2p (ua — 2) + (He = 2)? _ (Ha te — 2)? 
2 2 


Substituting these into the integral, we can show that 
oo - z)2 (41 +p2—2)? 
1 / l [ee es 
zZ(z)= exp dt 
fale) V2n0? Joo W200? 20? 


1 - gp \2 co 1 f— Hicbete 2 
ea coe se =e 
TO —oo TO 


1 { (1 + pe — 2)? \ 

exp 5 

/2n (20)? 2(207) 
Therefore, we have shown that the resulting distribution is a Gaussian with mean p11 + pe 
and variance 207. 


Practice Exercise 5.11. Let X and Y be independent, and let 


ye y = 0, 


48 < {V). 0, y <0. 


and fy(y) = 


Find the PDF of Z=X+Y. 


Solution. Using the results derived above, we see that 
fale) =f fxle—w feu) dy 
= [ _ txle- whol) dy, 
where the upper limit z came from the fact that « > 0. Therefore, since Z = X+Y, we 


must have Z—Y = X > O0andso Z > Y. This is portrayed graphically in Figure 5.14. 
Substituting the PDFs into the integration yields 


3 


fale) = f (e-v)e@ Wye" dy = Fe 
0 


For z <0, fz(z) =0. 


The functions of two random variables are not limited to summation. The following 
example illustrates the case of the product of two random variables. 


284 


5.5. SUM OF TWO RANDOM VARIABLES 


oo 


outer = f (inner)dy 


—co 


outer = | (inner)dy 
0 


Figure 5.14: [Left] The outer integral goes from 0 to z because the triangle stops at y = z. [Right] If 
the triangle is unbounded, then the integral goes from —oco to oo. 


Example 5.23. Let X and Y be two independent random variables such that 


DH it 0<e< 1 


fe = meee ={' poate 


otherwise, 0, otherwise. 


Let Z = XY. Find fz(z). 
Solution. The CDF of Z can be evaluated as 


Fo(e) =PIZ < 2] =PIXY <= [ if [Fo txt) ae ay, 


Taking the derivative yields 


fole)= ZFo(2)= 2 ff” fxlo)fvly) ae dy 


= : x (=) fy (y) dy, 


where (a) holds by the fundamental theorem of calculus. The upper and lower limit of 
this integration can be determined by noting that 


which implies that z < y. Since y < 1, we have that z < y < 1. Therefore, the PDF is 


fale) = f tx (2) fry) dy 


For z <0, fz(z) =0. 
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Closing remark. For some random variables, summing two i.i.d. copies remain the same 
random variable (but with different parameters). For other random variables, summing 
two i.i.d. copies gives a different random variable. Table 5.1 summarizes some of the most 
commonly used random variable pairs. 


Xi Xo Sum X; + Xe 

Bernoulli(p) Bernoulli(p) Binomial(2, p) 

Binomial(n, p) Binomial(m, p) Binomial(m + n, p) 
Poisson(1) Poisson(A2) Poisson(A; + A2) 
Exponential(,) Exponential(,) Erlang(2, X) 

Gaussian(j11, 0?) Gaussian (12, 73) Gaussian(j11 + [2,07 + 03) 


Table 5.1: Common distributions of the sum of two random variables. 


5.6 Random Vectors and Covariance Matrices 


We now enter the second part of this chapter. In the first part, we were mainly interested 
in a pair of random variables. In the second part, however, we will study vectors of N 
random variables. To understand a vector of random variables, we will not drill down to 
the integrations of the PDFs (which you would certainly not enjoy). Instead, we will blend 
linear algebra tools and probabilistic tools to learn a few practical data analysis techniques. 


5.6.1 PDF of random vectors 


Joint distributions can be generalized to more than two random variables. The most conve- 
nient way is to consider a vector of random variables and their corresponding states. 


Xi Ly 

X92 ) 
X=! ., and z= 

XN LN 


Our notation here is unconventional since bold upper case letters usually represent matrices. 
Here, X denotes a vector, specifically a random vector. Its state is a vector «x. In this chapter, 
we will use the following notational convention: X and Y represent random vectors while 
A represents a matrix. 

One way to think about X is to imagine that if you put your hand into the sample 
space, you will pick up a vector x. This random realization « has N entries, and so you 
need to specify the probability of getting all these entries simultaneously. Accordingly, we 
should expect that X is characterized by an N-dimensional PDF 


fx (@) = bp eee ote oe (x1, UQ,--- ,tN). 


5.6. RANDOM VECTORS AND COVARIANCE MATRICES 


Essentially, this PDF tells us the probability density for random variable X; = x), random 
variable X29 = £9, etc. It is a coordinate-wise description. For example, if X contains three 
elements such that X = [X,, X2, X3]", and if the state we are looking at is x = [3, 1, 7]”, 
then fx(a) is the probability density such that this 3D coordinate (X1, X2,X3) takes the 
value [3, 1, 7]”. 

To compute the probability, we can integrate fx (x) with respect to x. Let A be the 
event. Then 


=f fx, — Xy(@1,-.--,n) dx,...dry. 
A 


If the random coordinates X1,...,Xy are independent, the PDF can be written as a prod- 
uct of N individual PDFs: 


fx,,...,.Xn(1,---,2n) = fx, (01) fx, (2) ++ fxy (en), and so 


P[X € A]= i si I, RCH Re eC TO nme 


However, this does not necessarily simplify the calculation unless A is separable, e.g., A = 
[a1, by] x [a2, be] x +++ x [an, by]. In this case the integration becomes 


N db; 
PIx € A|=]] / fx, (wi) is ; 
(= 1, IL" 
which is obviously manageable. 


Example 5.24. Let X = [Xi,..., Xyw]" be a vector of zero-mean unit variance Gaus- 
sian random vectors. Let A = [—1,2]%. Then 


PIX € A] = | x(eyae 


=f. f Ionita tn) ded 
A 


7 iD fx, (1) i] 


where ®(-) is the standard Gaussian CDF. 


As you can see from the definition of a vector random variable, computing the proba- 
bility typically involves integrating a high-dimensional function, which is tedious. However, 
the good news is that in practice we seldom need to perform such calculations. Often we are 
more interested in the mean and the covariance of the random vectors because they usually 
carry geometric meanings. The next subsection explores this topic. 
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5.6.2 Expectation of random vectors 


Let X = [X1,...,Xyw]? be a random vector. We define the expectation of a random vector 
as follows. 


Definition 5.19. Let X = [X1,...,Xn]* be a random vector. The expectation is 


(5.32) 


The resulting vector is called the mean vector. Since the mean vector is a vector of 
individual elements, we need to compute the marginal PDFs before computing the expec- 
tations: 

m[X,] Joti en) day 
[xX]=] : |= 3 ; 
E[Xy] te tn fxn (tn) dxzn 


where the marginal PDF is determined by 


Fiea(tn) =f fr (\n)de\n 


In the equation above, 2, = [£1,...,%n—1,Tn41,--- , ty] contains all the elements with- 


out z,. For example, if the PDF is fx, x,,x,(@1, 22,23), then 


([X1] = fa J foxes (202,20) dzz dx3z dxy. 
Ny ____i_i_y 
fx, (#1) 


Again, this will become tedious when there are many variables. 


While the definition of the expectation may be challenging to understand, some prob- 
lems using it are straightforward. We will first demonstrate the case of independent Poisson 
random variables, and then we will discuss joint Gaussians. 


Example 5.25. Let X = [Xi,...,Xw]" be a random vector such that X,, are inde- 
pendent Poissons with X,, ~ Poisson(A,,). Then 


[X31] =o k i 


Xe] [pe ye. 


On computers, computing the mean vector can be done using built-in commands such 
as mean in MATLAB and np.mean in Python. However, caution is needed when performing 
the calculation. In MATLAB, mean computes along first dimension (rows index). Thus, if we 
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have an N x 2 array, applying mean will give us a 1 x 2 vector. To obtain the column mean 
vector of size N x 1, we need to specify the direction as mean(X,2). Similarly, in Python, 
when calling np.mean, we need to specify the axis. 


% MATLAB code to compute a mean vector 
randn(100,2); 
mean(X,2); 


# Python code to compute a mean vector 

import numpy as np 

import scipy.stats as stats 

X = stats.multivariate_normal.rvs([0,0],[[1,0],[0,1]],100) 
mX = np.mean(X,axis=1) 


5.6.3 Covariance matrix 


Definition 5.20. The covariance matrix of a random vector X = [Xi,...,Xwy]? is 


Var[X,] Cov(X1, X2) Same Cov(X,, Xn) 
Cov[X2, X41] Var|X2] Se Cov(X2, Xn) 


(5.33) 


Cov(Xy, X1) Cov(Xyn, X2) Se Var|Xy] 


A more compact way of writing the covariance matrix is 


X = Cov(X) = E[(X — p)(X — p)*), 


where ps = E[X] is the mean vector. The notation ab’ means the outer product, defined 
as 
ay ayb) aybg +++ abn 
ab =|} [bo byJ=f i Fo: 
an anby an be ed anbn 


It is easy to show that Cov(X) = Cov(X)’, ie., they are symmetric. 


Theorem 5.13. If the coordinates X,,...,Xy are independent, then the covariance 
matrix Cov(X) = & is a diagonal matriz: 


Var[X)] 
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Proof. If all X;’s are independent, then Cov(X;,X,;) = 0 for all i # 7. Substituting this 
into the definition of the covariance matrix, we obtain the result. 


If we ignore the mean vector yz, we obtain the autocorrelation matrix R. 


Definition 5.21. Let X = [Xi,...,Xn]? be a random vector. The autocorrelation 
matrix is 


TPG) BPS vo) bee, 
Lea BES) 2 DResGe 


Ieee) DPC e) ans IPS ce 


We state without proof that 
== R- pp’, 


which corresponds to the single-variable case where 0? = E[X?] — ?. 

On computers, computing the covariance matrix is done using built-in commands cov 
in MATLAB and np. cov in Python. Like the mean vectors, when computing the covariance, 
we need to specify the direction. For example, for an N x 2 data matrix X, the covariance 
needs to be a 2 x 2 matrix. If we compute the covariance along the wrong direction, we will 
obtain an N x N matrix, which is incorrect. 


% MATLAB code to compute covariance matrix 
X = randn(100,2); 
covX = cov(X); 


# Python code to compute covariance matrix 

import numpy as np 

import scipy.stats as stats 

X = stats.multivariate_normal.rvs([0,0],[[1,0],[0,1]],100) 
covX = np.cov(X,rowvar=False) 

print (covX) 


5.6.4 Multidimensional Gaussian 


With the above tools in hand, we can now define a high-dimensional Gaussian. The PDF of 
a high-dimensional Gaussian is defined as follows. 


Definition 5.22. A d-dimensional joint Gaussian has the PDF 


fx(2) = -5(e- wT Ee-p)}, 


1 
Jems] “P| 


where d denotes the dimensionality of the vector x. 
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The mean vector and the covariance matrix of a joint Gaussian is readily available from the 
definition. 


(|X] =p and Cov(X) = &. 


It is easy to show that if X is a scalar X, then d = 1, w = p, and © = o°. Substituting 
these into the above definition returns us the familiar 1D Gaussian. 

The d-dimensional Gaussian is a generalization of the 1D Gaussian(s). Suppose that X; 
and X, are independent for alli 4 7. Then E[X;X,;] = ELX;|E[X,] and hence Cov(X;, X;) = 
0. Consequently, the covariance matrix & is a diagonal matrix: 


of + (0 
Qo ee a 
where o? = Var[X;]. When this occurs, the exponential term in the Gaussian PDF is 


T 72 
Ly fly oy «+ O Ly py 


2 
La — fa O +++ og La — fla 


Moreover, the determinant |5)| is 


a? wee “@) . 
IBl=Jf: |) =] e?- 
Qo ass a7 i=1 


Substituting these results into the joint Gaussian PDF, we obtain 


tote) Uh eapag {aot 


which is a product of individual Gaussians. 

The Gaussian has different offsets and orientations for different choices of yz and &. 
Figure 5.15 shows a few examples. Note that for © to be valid © has to be “symmetric 
positive semi-definite”, the meaning of which will be explained shortly. 

Generating random numbers from a multidimensional Gaussian can be done by calling 
built-in commands. In MATLAB, we use mvnrnd. In Python, we have a similar command. 


n 


% MATLAB code to generate random numbers from multivariate Gaussian 
mu = [0 0]; 

Sigma = [.25 .3; .3 1]; 

x mvnrnd(mu,Sigma, 100) ; 


# Python code to generate random numbers from multivariate Gaussian 
import numpy as np 

import scipy.stats as stats 

X = stats.multivariate_normal.rvs([0,0],[[0.25,0.3],[0.3,1.0]] ,100) 
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5 5 5 

4 4 4 

3 3 3 

2 2 2 

1 1 1 

0 0 0 

| A A 

-2 -2 -2 

3 3 3 

-4 -4 4, ee 
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w=. o) emf}... 2] ea-f][2 Y 


Figure 5.15: Visualization of 2D Gaussians with different means and covariances. 


To display the data points and overlay with the contour, we can use MATLAB com- 
mands such as contour. The resulting plot looks like the one shown in Figure 5.16. In 
Python the corresponding command is plt.contour. To set up the plotting environment 
we use the commands np.meshgrid. The grid points are used to evaluate the PDF values, 
thus giving us the contour. 


% MATLAB code: Overlay random numbers with the Gaussian contour. 
X mvnrnd( [0 0],[.25 .3; .3 1],1000); 

xi =2;.5%0132.53 

x2 =3:553 0123.53 

[X1,X2] = meshgrid(x1,x2) ; 

F = mvnpdf((X1(:) X2¢:)],[0 0],[.25 .3; .3 1]); 

F = reshape(F,length(x2) ,length(x1)); 

figure(1); 

scatter (x(:,1),x(:,2),’rx’, ’LineWidth’, 1.5); hold on; 

contour (x1,x2,F,[.001 .01 .05:.1:.95 .99 .999], ’LineWidth’, 2); 


# Python code: Overlay random numbers with the Gaussian contour. 
import numpy as np 

import scipy.stats as stats 

import matplotlib.pyplot as plt 

X = stats.multivariate_normal.rvs([0,0],[[0.25,0.3],[0.3,1.0]],1000) 
x1 = np.arange(-2.5, 2.5, 0.01) 

x2 = np.arange(-3.5, 3.5, 0.01) 

X1, X2 = np.meshgrid(x1,x2) 

Xpos = np.empty(X1.shape + (2,)) 

Xpos[:,:,0] = X1 

Xpos[:,:,1] = X2 

F = stats.multivariate_normal.pdf(Xpos, [0,0], [[0.25,0.3],[0.3,1.0]]) 
plt.scatter(X[:,0],X[:,1]) 

plt.contour(x1,x2,F) 
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Figure 5.16: 1000 random numbers drawn from a 2D Gaussian, overlaid with the contour plot. 


5.7 Transformation of Multidimensional Gaussians 


As we have seen in Figure 5.15, the shape and orientation of a multidimensional Gaussian 
are determined by the mean vector ws and the covariance matrix &. This means that if we 
can somehow transform the mean vector and the covariance matrix, we will get another 
Gaussian. A few practical questions are: 


e How do we shift and rotate a Gaussian random variable? 


e If we have an arbitrary Gaussian, how do we go back to zero-mean unit-variance 
Gaussian? 


e How do we generate random vectors according to a predefined Gaussian? 
These questions come up frequently in data analysis. Answering the first two questions will 


help us transform Gaussians back and forth, while answering the last question will help us 
with generating random samples. 


5.7.1 Linear transformation of mean and covariance 


Suppose we have an arbitrary (not necessarily a Gaussian) random vector X = [Xi,...,Xw]" 
with mean fy and covariance ix. Entries of X are not necessarily independent. Let 
A€éR*" bea transformation, and let Y = AX. That is, 


Y a41 «12 ain X41 
Yo a21 422 a2Nn X2 

Y = — g > AX 
Yn @n1 @N2 *** Gnn}] [Xn 


Then we can show the following result. 


293 


CHAPTER 5. JOINT DISTRIBUTIONS 


Theorem 5.14. The mean vector and covariance matrix of Y = AX are 


by = Apty, Sy = AN, A?. 


Proof. We first show the mean. Consider the nth element of Y: 


N N 
a TY] =E BS eX _ S- ank a[Xx]. 
k=1 k=1 
Therefore, 
PI N rv 
a{Y1] Lk Q1k OX x] 
5/Yo] ga C2KE[Xy] 
Hy = : = : 
[Yn] ey aneELXy] 
a1, a2 *t* GN “[X,] 
a2, a2 a2N [Xo] 
— Pe => Apy. 
GN1 Q@N2 °*'' GNN “[Xn] 


The covariance matrix follows from the fact that 

Y — py)(Y — by)"] 

AX — Apx)(AX — Apx)?| 
A(X — py)(X — [x)' A] 
a[(X — px)(X — px)"A™ 


Sek 


| 
Gl 


What if we shift the random vector by defining Y = X + b? We state the following 
result without proof (try proving it as an exercise). 


Theorem 5.15. The mean vector and covariance matrix of Y = X +b are 


My =Ux +5, by = Xx. 


For a Gaussian random vector, the linear transformations either shifts the Gaussian or 
rotates the Gaussian, as shown in Figure 5.17: 


e If we add b to X, the resulting operation is a translation. 


e If we multiply A by X, then the resulting operation is a rotation and scaling. 
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Figure 5.17: Transforming a Gaussian. [Left] Translation by a vector b. [Right] Rotation and scaling by 
a matrix X. 


How to rotate, scale, and translate a Gaussian random variable 


e We rotate and scale a Gaussian by Y = AX. 
e We translate a Gaussian by Y = X +b. 


5.7.2 Eigenvalues and eigenvectors 


As our next step, we need to understand eigendecomposition. You can easily find relevant 
background in any undergraduate linear algebra textbook. Here we provide a summary for 
completeness. 


When applying a matrix A to a vector x, a typical engineering question is: what x 
would be invariant to A? Or in other words, for what 2 can we make sure that Ax = Aa, 
for some scalar \? If we can find such a vector x, we say that x is the eigenvector of A. 
Eigenvectors are useful for seeking principal components of datasets or finding efficient signal 
representations. They are defined as follows: 


Definition 5.23. Given a square matric A € RN*%, the vectoru € RX (with u 4 0) 
is called the eigenvector of A if 


Au = Xu, (5.38) 


for some X ER. The scalar X is called the eigenvalue associated with u. 


An N x N matrix has N eigenvectors and N eigenvalues. Therefore, the above equation can 
be generalized to 
Au; = iui, 


for i = 1,...,N, or more compactly as AU = AU. The eigenvalues \1,...,AyW are not 
necessarily distinct. There are matrices with identical eigenvalues, the identity matrix being 
a trivial example. On the other hand, not all square matrices have eigenvectors. For example, 


_ |O 1 : : ; 
the matrix i 4 does not have an eigenvalue. Matrices that have eigenvalues must be 


diagonalizable. 
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There are a number of equivalent conditions for \ to be an eigenvalue: 


There exists u # 0 such that Au = Xu; 
There exists wu 4 0 such that (A — AD)u = 0; 
(A — AT) is not invertible; 

det(A — AI) = 0. 


We are mostly interested in symmetric matrices. If A is symmetric, then all the eigen- 
values are real, and the following result holds. 


Theorem 5.16. If A is symmetric, all the eigenvalues are real, and there exists U 
such that U'U =I and A=UAU’. Then 


Ay 


| 
an 
| 


We call such a decomposition the eigendecomposition. In MATLAB, we can compute the 
eigenvalues of a matrix by using the eig command. In Python, the corresponding command 
is np.linalg.eig. Note that in our demonstration below we symmetrize the matrix. This 
step is needed, for otherwise the eigenvalues will contain complex numbers. 


MATLAB Code to perform eigendecomposition 

= randn(100,100); 

= (A + A’)/2; % symmetrize because A is not symmetric 
,9] = eig(A); % eigendecomposition 

= diag(S); % extract eigenvalue 


# Python Code to perform eigendecomposition 
import numpy as np 

A = np.random.randn(100, 100) 

A = (A + np.transpose(A))/2 

S, U = np.linalg.eig(A) 

s = np.diag(S) 


The condition that U7U = I is the result of an orthonormal matrix. Equivalently, 
ufu; =1lifi=j and ulu; =0 if i 4 j. Since {u;}%, is orthonormal, it can serve as a 
basis of any vector in R”: 


where a; = uy @ is called the basis coefficient. Basis vectors are useful in that they can 


provide alternative representations of a vector. 
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Figure 5.18: The center and the radius of the ellipse is determined by yz and J. 


The geometry of the joint Gaussian is determined by its eigenvalues and eigenvectors. 
Consider the eigendecomposition of &: 


y=UAU 
1 0 — ul = 
| | LPO Be a oe 
= |U, U2 sts) Ud re 2 . 2 . ’ 
| | | 3 : : . . 
O #42 436 Agh be ul _ 


for some unitary matrix U and diagonal matrix A. The columns of U are called the eigen- 
vectors, and the entries of A are called the eigenvalues. Since & is symmetric, all A,’s are 
real. In addition, since & is positive semi-definite, all \;’s are non-negative. Accordingly, the 
volume defined by the multidimensional Gaussian is always a convex object, e.g., an ellipse 
in 2D or an ellipsoid in 3D. 

The orientation of the axes is defined by the column vectors u,;. In the case of d = 2, 
the major axis is defined by wu, and the minor axis is defined by wp. The corresponding radii 
of each axis are specified by the eigenvalues ; and Ag. Figure 5.18 provides an illustration. 


5.7.3 Covariance matrices are always positive semi-definite 


The following subsection about positive semi-definite matrices can be skipped if it is your 


first time reading the book. 


Now that we understand eigendecomposition, what can we do with it? Here is one practical 
problem. Given a matrix ©, how do you know whether this & is valid? For example, if we 
give you a singular matrix, then ©~' may not exist. Checking the validity of © requires the 
concept of positive semi-definite. 

Given a square matrix A € R‘*N, it is important to check the positive semi-definiteness 
of A. There are two practical scenarios where we need positive semi-definiteness. (1) If 
we are estimating the covariance matrix 4 from a dataset, we need to ensure that © = 
5|(X — p)(X — p)"] is positive semi-definite because all covariance matrices are positive 
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semi-definite. Otherwise, the matrix we estimate is not a legitimate covariance matrix. (2) 
If we solve an optimization problem involving a function f(a) = a7 Aa, then having A 
being positive semi-definite, we can guarantee that the problem is convex. Convex problems 
ensure that a local minimum is also global, and convex problems can be solved efficiently 
using known algorithms. 


Definition 5.24 (Positive Semi-Definite). A matriz A € RN*% is positive semi- 
definite if 


x’ Ax >0 (5.40) 


for any x € R‘. A is positive definite if x7 Ax > 0 for any x € RN. 


Using eigendecomposition, it is not difficult to show that positive semi-definiteness is equiv- 
alent to having non-negative eigenvalues. 


Theorem 5.17. A matric A €¢ R‘*% is positive semi-definite if and only if 


Ai(A) = 0 (5.41) 


for alli =1,...,N, where \;(A) denotes the ith eigenvalue of A. 


Proof. By the definitions of eigenvalue and eigenvector, we have that 
Au; = \iui, 


where A; is the eigenvalue and u,; is the corresponding eigenvector. If A is positive semi- 
definite, then ul Au; > 0 since u; is a particular vector in R”. So we have 


0< us Au; = Allui|l?, 


and hence A; > 0. Conversely, if A; > 0 for all 2, then since A = oo AwuiuL we can 
conclude that 


N N 
ag! Ax =x" (>. su r= x Ni(ul ax)? > 0. 
i=1 i=1 


The following corollary shows that if A € R”*” is positive definite, it must be invert- 
ible. Being invertible also means that the columns of A are linearly independent. 


Corollary 5.2. If a matric A ¢ RX* is positive definite (but not semi-definite), 
then A must be invertible, i.e., there exists A~' © RN*N such that 


A‘A=AA'=T. (5.42) 


The next theorem tells us that the covariance matrix is always positive semi-definite. 
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Theorem 5.18. The covariance matrix Cov(X) = & is symmetric positive semi- 
definite, 1.c., 


St =d, and v'’Sv>0, WeER?. 


Proof. Symmetry follows immediately from the definition, because Cov(X;, X;) = Cov(X;, X;). 
The positive semi-definiteness comes from the fact that 
vy Sv = vw El(X — w)(X — p)" |v 
= Blv? (X = p)(X = p)"o) 
= E[b"b] = E[||b||] > 0, 


where b = (X — p)? v. 


End of the discussion. 


5.7.4 Gaussian whitening 


Besides checking positive semi-definiteness, another typical problem we encounter is how to 
generate random samples according to some Gaussian distributions. 


From Gaussian(0, I) to Gaussian(j, 5). If we are given zero-mean unit-variance Gaus- 
sian X ~ Gaussian(0, I), how do we generate Y ~ Gaussian(u, %) from X? 
The idea is to define a transformation 


Y=D2X +p, 


where ©2 — UA?U". Then the mean of Y is 


uY] =E[D2X + pl 
= SEX] +p=D20+p=yp, 


and the covariance matrix is 


+ p—p)(E2X+ pp)" 
)(22 X)?] = SPE[X X72 


a[(¥ - w)(¥ - w)"] = EE? 


Theorem 5.19. Let X be X ~ Gaussian(0,I). Consider a mean vector and a 
covariance matriz % with eigendecomposition 4 = UAU". If 


Y=D?2X +p, (5.43) 


where D3? = UA2U", then Y ~ Gaussian(p, %). 
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Therefore, the two steps for doing this Gaussian whitening are: 


e Step 1: Generate samples {x1,...,x,} that are distributed according to Gaussian(0, I). 
e Step 2: Define y,, where 
1 
YY, = UPB, +p. 


These two steps are portrayed in Figure 5.19. 


Figure 5.19: Generating an arbitrary Gaussian from Gaussian(0O, I). 


Example 5.26. Consider a set of N = 1000 i.i.d. Gaussian(0, I) data points as shown 
in Figure 5.20, for example, 


0s3r5 _ [-2.2588 ati = | 0-3188 
1 = 11.8399| ° "| Gago ||? ae ae SETI, 


5 5 
4 4 
3 3 
2 p 
1 1 
0 0 
1 1 
2 “2 
3 3 
4 4 
5 5 


“S469 220 123 45 “44-2410 12 3 4 5 
(a) Before (b) After 


Figure 5.20: Generating arbitrary Gaussian random variables from Gaussian(0, I). 


Transform these data points so that the new distribution is a Gaussian with 


1 3 -0.5 
w-|7] ane a Ee 1 
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Solution. To perform the transformation, we first perform eigendecomposition of © = 
1 1 
UAU”. Then ©? =UA2U". For our problem, we compute 


pi] 1722 0.1848 
~ |-0.1848 0.9828 | ° 


Multiplying this matrix to yield y, = Dia, + p, we obtain 
_ | 1.5870 _ |—3.0495 
i= | o.go7i|? #2 = |07a51)| 


In MATLAB, the above whitening procedure can be realized using the following com- 
mands. 


1.7907 
> -Yrooo ~ |_3.3441] ° 


% MATLAB code to perform the whitening 
x mvnrnd([0,0],[1 0; 0 1],1000); 
Sigma = [3 -0.5; -0.5 1]; 


mu [1; -2]; 
y Sigma*(0.5)*x’ + mu; 


The Python implementation is similar, although one needs to be careful with the 
more complicated syntax. For example, Sigma*(0.5) in MATLAB does the eigen-based 
matrix power automatically, whereas in Python we need to call a specific built-in command 
fractional_matrix_power. In MATLAB, broadcasting a vector to a matrix can be rec- 
ognized. In Python, we need to call repmat explicitly to control the shape of the mean 
vectors. 


# Python code to perform the whitening 

import numpy as np 

import scipy.stats as stats 

from scipy.linalg import fractional_matrix_power 


np.random.multivariate_normal([0,0],[[1,0],[0,1]] ,1000) 
= np.array([1,-2]) 
= np.array([[3, -0.5],[-0.5, 1]]) 

fractional_matrix_power (Sigma,0.5) 

np.dot(Sigma2, x.T) + np.matlib.repmat(mu,1000,1).T 


From Gaussian(j, &) to Gaussian(0, I). The reverse direction can be done as follows. 
Supposing that we have Y ~ Gaussian(j, ©), we define 


X=D72(Y¥ —p). (5.44) 


Then 
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The covariance is 


Cov(X) = E[(X — px)(X — px)*] 
=E[XX7| 
_E [= 2(Y —p)(Y —p)?=- 


wha 
Jy ou 4 


= 2° ?E [(¥ — p)(Y —p)7] => 
= 35S = 7, 


The following theorem summarizes this result. 


Theorem 5.20. Let Y be a Gaussian Y ~ Gaussian(p, d). If 


X= py 07 70) 


then X ~ Gaussian(0, I). 


Thus the two steps of doing this reversed Gaussian whitening are: 


e Step 1: Assuming that y,,...,y, are distributed as Gaussian(w,¥), estimate ps 
and &. 


e Step 2: Define x, where 
1 
ay = D4 (y, — 1). (5.46) 


These two steps are shown pictorially in Figure 5.21. 


Figure 5.21: Converting an arbitrary Gaussian back to Gaussian(0, I). 


In practice, if we are given {y,,}‘_,, we need to estimate yw and Y. The estimations 
are quite straightforward. 
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On computers, these can be obtained using the command mean and cov. Once we have 
calculated fj and BJ, we can define x, as 


Beek 
2, =S *(y,—- A). 


On computers, the codes for the whitening procedure that uses the estimated mean 
and covariance are shown below. 


% MATLAB code to perform whitening 
mvnrnd([1; -2],[3 -0.5; -0.5 1],100); 
mean(y) ; 
cov(y); 
covY*(-0.5)*(y-mY)’; 


# Python code to perform whitening 

import numpy as np 

import scipy.stats as stats 

from scipy.linalg import fractional_matrix_power 

y = np.random.multivariate_normal([1,-2],[[3,-0.5],[-0.5,1]],100) 
mY = np.mean(y,axis=0) 

covY = np.cov(y,rowvar=False) 

covY2 = fractional_matrix_power (covY,-0.5) 

x np.dot(covY2, (y-np.matlib.repmat (mY,100,1)).T) 


5.8 Principal-Component Analysis 


We have studied the covariance matrix & in some depth. It has many other uses besides 
transforming Gaussian random variables, and in this section we present one of them, called 
the principal-component analysis (PCA). PCA is a widely used tool for dimension reduc- 
tion. Instead of using N features to describe a data point, PCA allows us to use the leading 
p principal components to describe the same data point. In many problems in machine 
learning, this makes the learning task easier and the inference task more efficient. 


5.8.1 The main idea: Eigendecomposition 


PCA can be summarized in one sentence: 


This is a condensed summary of PCA: It is just the eigendecomposition of the co- 
variance. However, before we discuss the computational procedure, we will explain why we 
would want to perform the eigendecomposition of the covariance matrix. 
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Consider a set of data points {a#,..., 20) }, where each 2 € R? is a d-dimensional 
vector. The dimension d is often high. For example, if we have an image of size 1024 10243, 
then d = 3,145,728 — not a huge number, but enough to make you feel dizzy. The goal 
of PCA is to find a low-dimensional representation in R? where p < d. If we can find 
this low-dimensional representation, we can represent the d-dimensional input using only p 
coefficients. Since p < d, we can “compress” the data by using a compact representation. In 
modern data science, such a dimension reduction scheme is useful for handling large-scale 
datasets. 

Mathematically, we define a set of basis vector vj,...,v,, where each v; € R¢. Our 
goal is to approximate an input data point a2‘) € R@ by these basis vectors: 


P 
a) wy s AiVi, 
i=l 


where {a;}?_, are called the representation coefficients. The representation described by 
this equation is a linear representation. Linear representation is extremely common in prac- 
tice. For example, a data point a”) = [7,1,4]” can be represented as 


7 1 1 

1) =. 3, ]-1} +.4,]1 

4 a, | O as. | 
“~~ ~~" “ 
a(n) V1 v2 


Therefore, the 3-dimensional input a‘ can now be represented by two coefficients a, = 3 
and az = 4. This is called dimensionality reduction. 

Pictorially, if we have already determined the basis vectors, we can compute the co- 
efficients for every data point in the dataset. However, not all basis vectors are good. As 
illustrated in Figure 5.22, an elongated dataset will be of the greatest benefit if the basis 
vectors are oriented according to the data geometry. If we can find such basis vectors, then 
the data points will have a large coefficient and a small coefficient, corresponding to the 
major and the minor axes. Dimensionality reduction can thus be achieved by, for example, 
only keeping the larger coefficients. 


Figure 5.22: PCA aims at finding a low-dimensional representation of a high-dimensional dataset. In 
this figure, the 2D data points can be well represented by the 1D space spanned by v4. 


The challenge here is that, given the dataset {a),...,2)}, we need to determine 


both the basis vectors {v;}?_, and the coefficients {a;}?_,. Fortunately, this can be formu- 
lated as an eigendecomposition problem. 
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To see how this problem can be thus formulated, we consider the simplest case as 
illustrated in Figure 5.22, where we want to find the leading principal component. That is, 
we find (a,v) such that « © av. This amounts to solving the optimization problem 

| I)" 
(v,@) = argmin |||a} —a |v 
Ilvlla=1e]) | | | 


The notation “argmin” means the argument that minimizes the function. The equation 

says that we find the (a, v) that minimizes the distance between x and av. The constraint 

||v||2 = 1 limits the search to within a unit circle; otherwise our solution will not be unique. 
Solving the optimization problem is not difficult. If we take the derivative w.r.t. a and 

set it to zero, we have that 

2u" (a — av) =0 => a=v' a. 


Substituting a = «7 v into the objective function again, we show that 


argmin ||a — av||? = aregmin {ere — 2aa7 vu + Parl, |v|l2 =1 
\lvl2=1 \|v||2=1 
= argmin { —2axtv + ot, drop «7 a 
l|v|l2=1 
= argmin { — 2(a7 v)x7 v + (ary, substitute a = a7 v 
l|vl2=1 
_ ToT : 
=argmax (Uv 2xE Ue, change min to max. 
|v||o=1 


Let us pause for a second. We have shown that if we have one data point x, the leading 
principal component v can be determined by maximizing v’ wx”? v. What have we gained? 
We have transformed the original optimization, which contains two variables (v, a), to a new 
optimization that contains one variable v. Thus if we know how to solve the one-variable 
problem we are done. 

However, there is one more issue we need to address before we discuss how to solve 
for the problem. The issue is that the formulation is about one data sample, not the entire 
dataset. To include all the samples, we need to assume that x is a realization of a random 
vector X. Then the above optimization can be formulated in the expectation sense as 


argmin E||_X — av||? = argmax v? 2fxxt by 


l|v|l2=1 |vl2=1 


= argmaxv! Dv, 
l|vl]o=1 


where 5 © )[X" X].1 Therefore, if we can maximize v7 Sv we will be able to determine 
the principal component. 

Now comes the main result. The following theorem shows that the maximization is 
equivalent to eigendecomposition. The proof requires Lagrange multipliers, which are beyond 
the scope of this book. 


1Here we assume that X is zero-mean, i.e., E[X] = 0. If it is not, then we can subtract the mean by 
considering argmax vTE{ (x — p)(X — whe, 
|v |l2=1 
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Theorem 5.21. Let © be ad x d matrix with eigendecomposition & =USU". Then 
the optimization 


8 = argmax v' Dv (5.47) 
Ile ]2=1 


has a solution 0 = uy, 1.e., the first column of the eigenvector matrix U. 


The following proof requires an understanding of Lagrange multipliers and constrained 
optimizations. It is not essential for understanding this chapter. 


We want to prove that the solution to the problem 


® = argmaxv! Dv 
I|||2=1 


is the eigenvector of the matrix 4. To show that, we first write down the Lagrangian: 
L(v, A) = v? Sv — X(||v||? — 1) 

Taking the derivative w.r.t. v and setting to zero yields 
VoLl(v, A) = 2du — 2\u = 0. 


This is equivalent to Sv = Av. So if 4 = USU", then by letting v = u; and A = s; we can 
satisfy the condition since Mw; = USU"u; = USe; = 5;u;. 


End of the proof. 


This theorem can be extended to the second (and other) principal components of 
the covariance matrix. In fact, given the covariance matrix & we can follow the procedure 
outlined in Figure 5.23 to determine the principal components. The eigendecomposition of a 
dxd matrix & will give us a dx d eigenvector matrix U and an eigenvalue matrix S. To keep 
the p leading eigenvectors, we truncate the U matrix to only use the first p eigenvectors. Here, 
we assume that the eigenvectors are ordered according to the magnitude of the eigenvalues, 
from large to small. 

In practice, if we are given a dataset {a,...,a°%)}, we can first estimate the covari- 
ance matrix 4 by 


N 
— 1 n ~ n ~ 
B=5 De® - Ae -a, 
n=1 


ee N . ‘ 
where pp = a os x” is the mean vector. Afterwards, we can compute the eigendecom- 


position of = by 


[U, 5] = eig(S). 
On a computer, the principal components are obtained through eigendecomposition. 
A MATLAB example and a Python example are shown below. We explicitly show the two 
principal components in this example. The magnitudes of these two vectors are determined 
by the eigenvalues diag(s). 


306 


5.8. PRINCIPAL-COMPONENT ANALYSIS 


keep drop keep drop 


Figure 5.23: The principal components are the eigenvectors of the covariance matrix. In this figure 4 
denotes the covariance matrix, wi,..., Wp denote the p leading eigenvectors, and s denotes the diagonal 
of the eigenvalue matrix. 


% MATLAB code to perform the principal-component analysis 
x = mvnrnd([0,0],[2 -1.9; -1.9 2],1000); 

covX = cov(x); 

[U,S] = eig(covX) ; 

u(:,1) % Principle components 

u(:,2) % Principle components 


# Python code to perform the principal-component analysis 

import numpy as np 

x = np.random.multivariate_normal([1,-2],[[3,-0.5],[-0.5,1]],1000) 
covX = np.cov(x,rowvar=False) 

S, U = np.linalg.eig(covX) 

print (U) 


Example 5.27. Suppose we have a dataset containing N = 1000 samples, drawn from 
an unknown distribution. The first few samples are 


a, — | 0-5254 ep = [04040 eran = | 2ttes 
Ps 6 G80)o = | god. ees OU a Aee| * 


We can compute the mean and covariance using MATLAB commands mean and cov. 
This will return us 


e | 0.0561 


1S 


_ = [| 2.0460 —1.9394 
~0.0303 


and 2 =|_1 9394 2.0426 


Applying eigendecomposition on s, we show that 


[U, 5] = eig(S), 


—0.7068 ee 


—0.7074 0.7068 


=u=| 0 3.9837] ° 


eal s= [Ne 0 | 
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Therefore, we have obtained two principal components 


—0.7074 


_ [-0.7068 
ae 0.7068 


soe ie ara | 


As seen in the figure below, these two principal components make sense. The vector 
u, is the orange line and is the minor axis. The vector uz is the blue line and is the 


major axis. Again, the ordering of the vectors is determined by the eigenvalues. Since 
wz has a larger eigenvalue (=3.9837), it is the leading principal component. 


5 - 


To be determined 


data 


Figure 5.24: To determine the representation coefficients, we solve an inverse problem by finding the 
vector a in the equation #‘") = U,a™, 


Why do we call our method principal component analysis? The analysis part comes 
from the fact that we can compress a data vector #”) from a high dimension d to a low 
dimension p. Defining U, = [ui,..., up|], a matrix containing the p leading eigenvectors of 
the matrix U, we solve the inverse problem: 


2”) =U,a), 


where the goal is to determine the coefficient vector a” € R?. Since U p is an orthonormal 
matrix (ie., UP U, = 1), it follows that 


Uje™ =ULU,a™, 
ee 


=I 
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as illustrated in Figure 5.24. Hence, 
al”) = ula”, 
This equation is a projection operation that projects a data point 2‘) onto the space 


spanned by the p leading principal components. Repeating the procedure for all the data 
points #),...,a2%) in the dataset, we have compressed the dataset. 


Example 5.28. Using the example above, we can show that 


o® =UTe@ = heel a? = Ral (2000) _ eal 


—0.8615} ’ 0.5491] ’ —2.0950| ° 


The principal-component analysis says that since the leading components represent the 
data, we only need to keep the blue-colored values because they are the coefficients 
associated with the leading principal component. 


5.8.2 The eigenface problem 


As a concrete example of PCA, we consider a computer vision problem called the eigen- 
face problem. In 2001, researchers at Yale University published the Yale Database, and 
a few years later they extended it to a larger one (http://vision.ucsd.edu/~leekc/ 
ExtYaleDatabase/ExtYaleB.htm1). The dataset, now known as the Yale Face Dataset, con- 
tains 16,128 images of 28 human subjects under nine poses and 64 illumination conditions. 
The sizes of the images are d = 168 x 192 = 32,256 pixels. Treating these N =16,128 images 
as vectors in R3?:256*! we have 16,128 of these vectors. Let us call them {a#,...,a@(%)}. 

Following the procedure we described above, we estimate the covariance matrix by 
computing 


> as ~ 1 n ~ n ~ 
8 = EX - A(X A)" ~ — Ve” — Ayer — a", (5.48) 
n=1 
where # = E[X] © xy a _, x™) is the mean vector. Note that the size of fi is 32,256x1 


and the size of © is 32,256 x 32,256. 


Figure 5.25: The extended Yale Face Database B. 


Once we obtain an estimate of the covariance matrix, we can perform an eigendecom- 
position to get = 
[U, S] = cig(B). 


The columns of U, i.e., {u;}4_,, are the eigenvectors of S. These eigenvectors are the basis 
of a testing face image. 
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a 


Figure 5.26: Given a face image, the learned basis vectors (from the eigendecomposition of the covari- 
ance matrix) can be used to compress the image a into a feature vector a where the dimension of @ is 
significantly lower than that of a. 


With the basis vectors uj,...,U,) we can project every image in the dataset using a 
low-dimensional representation. Specifically, for an image 2 we compute the coefficients 


if Ba . 
x, t=1,...,p 


or more compactly a = U7 x. Note that the dimension of x is d x 1 (which in our case is 
d = 32,526), and the dimensions of a can be as few as p = 100. Therefore, we are using a 
100-dimensional vector to represent a 32,526-dimensional data. This is a huge dimensionality 
reduction. 

The process repeats for all the samples #),..., a). This gives us a collection of rep- 
resentation coefficients a),...,a), where each a” is 100-dimensional (see Figure 5.26). 
Notice that the basis vectors wu; appear more or less “face images,” but they are the features 
of the faces. PCA says that a real face can be written as a linear combination of these basis 
vectors. 


to solve the eigenface problem 


Compute the covariance matrix of all the images. 


Apply eigendecomposition to the covariance matrix. 


Project onto the basis vectors and find the coefficients. 
The coefficients are the low-dimensional representation of the images. 


We use the coefficients to perform downstream tasks, such as classification. 
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5.8.3. What cannot be analyzed by PCA? 


PCA is a dimension reduction tool. It compresses a raw data vector 2 € R@ into a smaller 
feature vector a € R?. The advantage is that the downstream learning problems are much 
easier because p < d. For example, classification using @ is more efficient than classification 
using x since there is very little information loss from x to a. 


There are three limitations of PCA: 


e PCS fails when the raw data are not orthogonal. The basis vectors u; returned 
by PCA are orthogonal, meaning that U, Uj = 0 as long asi # j. As a result, if 
the data intrinsically have this orthogonality property, then PCA will work very well. 
However, if the data live in a space such as a donut shape as illustrated in Figure 5.27, 
then PCA will fail. Here, by failure, we mean that p is not much smaller than d. To 
handle datasets behaving like Figure 5.27 we need advanced tools. One of these is the 
kernel-PCA. The idea is to apply a nonlinear transformation to the data before you 
run PCA. 


v xX x 


Figure 5.27: [Left] PCA works when the data has redundant dimensions or is living on orthogonal 
spaces. [Right] PCA fails when the data does not have easily decomposable spaces. 


e Basis vectors returned by PCA are not interpretable. A temptation with PCA is to 
think that the basis vectors wu; offer meaningful information because they are the “prin- 
cipal components”. However, since PCA is the eigendecomposition of the covariance 
matrix, which is purely a mathematical operation, there is no guarantee that the basis 
vectors contain any semantic meaning. If we look at the basis vectors shown in Fig- 
ure 5.26, there is almost no information one can draw. Therefore, in the data-science 
literature alternative methods such as non-negative matrix factorization and the more 
recent deep neural network embedding are more attractive because the feature vectors 
sometimes (not always) have meanings. 


PCA does not return you the most influential “component”. Imagine that you 
are analyzing medical data for research on a disease, in which each data vector a”) 
contains height, weight, BMI, blood pressure, etc. When you run PCA on the dataset, 
you will obtain some “principal components”. However, these principal components 
will likely have everything, e.g., the height entry of the principal component will have 
some values, the weight will have some values, etc. If you have found a principal 
component, it does not mean that you have identified the leading risk factor of the 
disease. If you want to identify the leading risk factor of the disease, e.g., whether 
the height or weight is more important, you need to resort to advanced tools such as 
variable selection or the LASSO type of regression analysis (see Chapter 7). 
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Closing remark. PCAs are powerful computational tools based on the simplest concept of 
covariance matrices because, as our derivation showed, covariance matrices encode the “vari- 
ation” of the data. Therefore, by finding a vector that aligns with the maximum variation 
of the data, we can find the principal component. 


5.9 Summary 


As you were reading this chapter, you may have felt that the first and second parts discuss 
distinctly different subjects, and in fact many books treat them as separate topics. We take 
a different approach. We think that they are essentially the same thing if you understand 
the following chain of distributions: 


fx (2) => fx,,x, (1,02) => +> => fx, jeep Hy (Pigecry ty). 
SN’ e——-_—»s>—__“_’ ee 
one variable two variables N variables 


The first part exclusively deals with two variables. The generalization from two variables to 
N variables is straightforward for PDF's and CDFs: 

e PDF: fx, ,X2(@1, 22) => fr, Rees Xy(%1,---, ZN). 

e CDE: Fx, x5 (£1, 02) => Fx, sats Xp (1,--+;%N)- 


The joint expectation can also be generalized from two variables to N variables: 


Var[X?] -++ Cov(X1, Xn) 
Var[X?] Cov(X1, X2) — ‘ ° 
Cov(X2, X1) Var[X3] : “ : 

Cov(Xy,X1) «++: Var[X%,] 

Conditional PDF's and conditional expectations are powerful tools for decomposing 

complex events into simpler events. Specifically, the law of total expectation, 


BIX] = f BLXIY = ylfv(u) dy = Ev [xiv [X11 


is instrumental for evaluating variables defined through conditional relationships. The idea 
is also extendable to more random variables, such as 


E[X,] = / / Bk es, ashe, a) dae, 


where E[X,|X2 = 2, X3 = 23] can be evaluated through 


E[Xi|X2 = 22, X3 = x3] = f erfscixacxs(n | 22,23) day. 
This type of chain relationship can generalize to other high-order cases. 
It is important to remember that for any high-dimensional random variables, the char- 


acterization is always made by the PDF fx (a) (or the CDF). We did not go into the details 
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of analyzing fx(ax) but have only discussed the mean vector EX] = w and the covariance 
matrix Cov(X) = ©. We have been focusing exclusively on the high-dimensional Gaussian 
random variables 


fx(e) = 1 (a wyrE(e- wh, 


1 
Jems «| 2 


because they are ubiquitous in data science today. We discussed the linear transformations 
from a zero-mean unit-variance Gaussian to another Gaussian, and vice versa. 
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5.11 Problems 


Exercise 1. (VIDEO SOLUTION) 

Alex and Bob each flips a fair coin twice. Use “1” to denote heads and “0” to denote tails. 
Let X be the maximum of the two numbers Alex gets, and let Y be the minimum of the 
two numbers Bob gets. 


(a) Find and sketch the joint PMF px y(z,y). 
(b) Find the marginal PMF px(x) and py(y). 
(c) Find the conditional PMF Px\y(#|y). Does Pxy («| y) = Px(a#)? Why or why not? 


Exercise 2. 
Two fair dice are rolled. Find the joint PMF of X and Y when 


(a) X is the larger value rolled, and Y is the sum of the two values. 


(b) X is the smaller, and Y is the larger value rolled. 


Exercise 3. 
The amplitudes of two signals X and Y have joint PDF 


2 
fxy(a,y) = e7*/?ye¥ 


for 7 > 0,y > 0. 
(a) Find the joint CDF. 


(b) Find P(X!/?2 > Y). 
(c) Find the marginal PDFs. 
Exercise 4. (VIDEO SOLUTION) 


Find the marginal CDFs F'x(#) and Fy(y) and determine whether or not X and Y are 
independent, if 


y 


ene. Piste eS 0 
Fxy(2,y)=y1-<“S8", fx > 2,y20, 
0, otherwise. 


Exercise 5. (VIDEO SOLUTION) 
(a) Find the marginal PDF fx(z) if 


fxy (x,y) = veh “ ie 
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(b) Find the marginal PDF fy (y) if 


de-(@-y)?/2 
x, = 
fxy (%,¥) PVin 


Exercise 6. (VIDEO SOLUTION) 
Let X,Y be two random variables with joint CDF 


y+ e~ (yt) 
F — 
x,y (2, Y) y +4 1 
Show that 
Oo? PF oO? F 
andy x,y (2, Y) = ayOx X,Y (2, y)- 


What is the implication of this result? 


Exercise 7. (VIDEO SOLUTION) 
Let X and Y be two random variables with joint PDF 


fxy(x,y) = Sg HP), 
, 20 


(a) Find the PDF of Z = max(X,Y). 
(b) Find the PDF of Z = min(X,Y). 


You may leave your answers in terms of the ®(-) function. 


Exercise 8. 
The random vector (X,Y) has a joint PDF 


fxy(a,y) = 2e7% e774 
for x > 0, y > 0. Find the probability of the following events: 
(a) {X+Y < 8}. 
(b) {X —Y < 10}. 
(oe) 4x7 ey}. 
Exercise 9. 


Let X and Y be zero-mean, unit-variance independent Gaussian random variables. Find the 
value of r for which the probability that (X,Y) falls inside a circle of radius r is 1/2. 


Exercise 10. 

The input X to a communication channel is +1 or —1 with probabilities p and 1 — p, 
respectively. The received signal Y is the sum of X and noise N, which has a Gaussian 
distribution with zero mean and variance a? = 0.25. 
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(a) Find the joint probability P(X = j, Y < y). 
(b) Find the marginal PMF of X and the marginal PDF of Y. 


(c) Suppose we are given that Y > 0. Which is more likely, X = 1 or X = —1? 


Exercise 11. (VIDEO SOLUTION) 


Let 
bee taca ce *e ¥, if O<y<a<wm, 
rs Of = 
ceed 0, otherwise. 
(a) Find c. 
(b) Find fx(a) and fy(y). 


(c) Find ELX] and E[Y], Var[X] and Var[Y]. 


) 
) 
) 
(d) Find E[XY], Cov(X,Y) and p. 


Exercise 12. (VIDEO SOLUTION) 
In class, we have used the Cauchy-Schwarz inequality to show that —1 < p < 1. This exercise 
asks you to prove the Cauchy-Schwarz inequality: 


(E[XY])* < E[X7]E[Y”]. 


Hint: Consider the expectation E[(tX +Y)?]. Note that this is a quadratic equation in t and 
i[(tX +Y)?] > 0 for all t. Consider the discriminant of this quadratic equation. 


Exercise 13. (VIDEO SOLUTION) 
Let O ~ Uniform|{0, 27]. 


(a) If X =cosO, Y =sinO. Are X and Y uncorrelated? 
(b) If X = cos(O/4), Y = sin(O/4). Are X and Y uncorrelated? 


Exercise 14. (VIDEO SOLUTION) 
Let X and Y have a joint PDF 


Ixy (2, y) = ex +9), 
for0<a2<land0<y<1. 
a) Find c, fx(2), fr(y), and E[Y]. 
b) Find fy\x (yz). 
) 
) 


( 
( 


(c) Find P[Y > X |X > 1/2]. 


(d) Find E[Y|X = a]. 
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(e) Find E[E[Y|X]], and compare with the E[Y] computed in (a). 


Exercise 15. (VIDEO SOLUTION) 
Use the law of total expectation to compute the following: 


1. Efsin(X + Y)], where X ~ N(0,1), and Y | X ~ Uniform[z — 7,2 +7] 
2. P[Y < y], where X ~ Uniform[0, 1], and Y | X ~ Exponential(x) 
3. E[Xe*], where X ~ Uniform[—1, 1], and Y | X ~ N(0, x?) 


Exercise 16. 

Let Y = X+N, where X is the input, N is the noise, and Y is the output of a system. Assume 
that X and N are independent random variables. It is given that E[X] = 0, Var[X] = 0%, 
i[N] = 0, and Var[N] = 02. 


(a) Find the correlation coefficient p between the input X and the output Y. 


(b) Suppose we estimate the input X by a linear function g(Y) = aY. Find the value of 
a that minimizes the mean squared error E[(X — aY)?]. 


(c) Express the resulting mean squared error in terms of 7 = 0% /o%y. 


Exercise 17. (VIDEO SOLUTION) 
Two independent random variables X and Y have PDFs 


e*, x > 0, 0, y > 0, 
vw) = — 

fx (x) ‘s ee: fy(y) 

Find the PDF of Z = X —Y. 


Exercise 18. 
Let X and Y be two independent random variables with densities 


re”, xz > 0, ye %, y = 0, 
— d — 
Ix(x) fi x <0, and f(y) ti y <0. 
Find the PDF of Z=X+Y. 


Exercise 19. 
The random variables X and Y have the joint PDF 


fxy(z,y) =e Ow 


for0<y<a<1. Find the PDF of 7=X+Y. 


Exercise 20. 
The joint density function of X and Y is given by 


fxy(2,y) =e Ov 


for x > 0,y > 0. Find the PDF of the random variable Z = X/Y. 
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Chapter 6 


Sample Statistics 


When we think about probability, the first thing that likely comes to mind is flipping a coin, 
throwing a die, or playing a card game. These are excellent examples of the subject. However, 
they seldom fit in the context of modern data science, which is concerned with drawing 
conclusions from data. In our opinion, the power of probability is its ability to summarize 
microstates using macro descriptions. This statement will take us some effort to elaborate. 
We study probability because we want to analyze the uncertainties. However, when we 
have many data points, analyzing the uncertainties of each data point (the microstates) 
is computationally very difficult. Probability is useful here because it allows us to bypass 
the microstates and summarize the macro behavior. Instead of reporting the states of each 
individual, we report their sample average. Instead of offering the worst-case guarantee, 
we offer a probabilistic guarantee. You ask: so what? If we can offer you a performance 
guarantee at 99.99% confidence but one-tenth of the cost of a 100% performance guarantee, 
would you consider our offer? The goal of this chapter is to outline the concepts of these 
probabilistic arguments. 


The significance of sample average 


Imagine that you have a box containing many tiny magnets. (You can also think of a dataset 
containing two classes of labels.) In condensed matter physics, these are known as the spin 
glasses. The orientations of the magnets depend on the magnetic field. Under an extreme 
condition where the magnetic field is strong, all magnets will point in the same direction. 
When the magnetic field is not as strong, some will align with the field but some will not, 
as we show in Figure 6.1. 

If we try to study every single magnet in this box, the correlation of the magnets will 
force us to consider a joint distribution, since if one magnet points to the right it is likely 
that another magnet will also point to the right. The simultaneous description of all magnets 
is modeled through a joint probability distribution 


FG Reine hy (Mig Borer ghy): 


Like any joint PDF, this PDF tells us the probability density that the magnets will take 
a collection of states simultaneously. If N is large (say, on the order of millions), this joint 
distribution will be very complicated. 
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Experiment 1 Experiment M 


P_ 


— yt 


a = ngle 


Figure 6.1: Imagine that we have a box of magnets and we want to measure their orientation angles. 
The data points have individual randomness and correlations. Studying each one individually could be 
computationally infeasible, as we need to estimate the joint PDF fx,,....x,(%1,.-.,@n) across all the 
data points. Probability offers a tool to summarize these individual states using a macro description. 
For example, we can analyze the sample average Xn of the data points and derive conclusions from 
the PDF of Xn, i.e., pat (x). The objective of this chapter is to present a few probabilistic tools to 
analyze macro descriptions, such as the sample average. 


Since the joint PDF is very difficult to obtain computationally, physicists proposed 
to study the sample statistics. Instead of looking at the individual states, they look at the 
sample average of the states. If we define X),...,Xy as the states of the magnets, then 
the sample average is 


1 N 
n=1 


Since each magnet is random, the sample average is also random, and therefore it is granted 
a PDF: 


let (x). 


Thus, X, has a PDF, a mean, a variance, and so on. 

We call Xy a sample statistic. It is called a statistic because it is a summary of the 
microstates, and a sample statistic because the statistic is based on random samples, not on 
the underlying theoretical distributions. We are interested in knowing the behavior of X y 
because it is the summary of the observations. If we know the PDF of Xn, we will know 
the mean, the variance, and the value of X y when the magnetic field increases or decreases. 


Why study the sample average X )? 


e Analyzing individual variables is not feasible because the joint PDF can be ex- 
tremely high-dimensional. 


e Sample average is a macro description of the data. 


e If you know the behavior of the sample average, you know most of the data. 


Probabilistic guarantee versus worst-case guarantee 


Besides the sample average, we are also interested in the difference between a probabilistic 
guarantee and a deterministic guarantee. 
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Consider the birthday paradox (see Chapter 1 for details). Suppose there are 50 stu- 
dents in a room. What is the probability that at least two students have the same birthday? 
A naive thought would suggest that we need 366 students to guarantee a pair of the same 
birthday because there are 365 days. So, with only 50 students, it would seem unlikely to 
have a pair with the same birthday. However, it turns out that with just 50 students, the 
probability of having at least one pair with the same birthday is more than 97%. Figure 6.2 
below shows a calculation by a computer, where we plot the estimated probability as a func- 
tion of the number of students. What is more surprising is that with as few as 23 students, 
the probability is greater than 50%. There is no need for there to be 365 students in order 
to offer a guarantee. 


0 10 20 30 40 50 60 70 80 90 100 

Number of people 
Figure 6.2: The birthday paradox asks the question of how many people we need to ask in order to have 
at least two of them having the same birthday. While we tend to think that the answer is 366 (because 
there are 365 days), the actual probability, as we have calculated (see Chapter 1), is more than 97%, 
even if we have only asked 50 people. The curve above shows the probability of having at least one pair 


of people having the same birthday as a function of the number of people. The plot highlights the gap 
between the worst-case performance and an average-case performance. 


Why does this happen? Certainly, we can trace back to the formulae in Chapter 1 and 
argue through the lens of combinations and permutations. However, the more important 
message is about the difference between the worst-case guarantee and the average-case 
guarantee. 


Worst case versus average case 


e Worst-case guarantee: You need to ensure that the worst one is protected. This 
requires an exhaustive search until hitting 100%. It is a deterministic guarantee. 


e Average-case guarantee: You guarantee that with a high probability (e.g., 99.99%), 
the undesirable event does not happen. This is a probabilistic guarantee. 


Is there a difference between 99.99% and 100%? If the probability is 99.99%, there is 
one failure every 10,000 trials on average. You are unlikely to fail, but it is still possible. 
A 100% guarantee says that no matter how many trials you make you will not fail. The 
99.99% guarantee is much weaker (yes, much weaker, not just a little bit weaker) than the 
deterministic guarantee. However, in practice, people might be willing to pay for the risk in 
exchange for efficiency. This is the principle behind insurance. Automobile manufacturing 
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also uses this principle — your chance of purchasing a defective car is non-zero, but if the 
manufacturer can sell enough cars to compensate for the maintenance cost of fixing your 
car, they might be willing to offer a limited warranty in exchange for a lower selling price. 
How do we analyze the probabilistic guarantee, e.g., for the sample average? Remember 
that the sample average X y is a random variable. Since it is a random variable, it has a 
mean, variance, and PDF.! To measure the probabilistic guarantee, we consider the event 


def >i 
B= {|Xn —u| 2}, 


where pp = E[Xy] is the true population mean, and € > 0 is a very small number. This 
probability is illustrated in Figure 6.3, assuming that Xy has the PDF of a Gaussian. The 
probability of B is the two tails under the PDF. Therefore, 6 is a bad event because in 
principle Xj should be close to y. The probability P[B] measures situations where X y 
stays very far from yp. If we can show that P[6] is small (e.g., < 0.01%), then we can say 
that we have obtained a probabilistic guarantee at 99.99%. 


0.4 T T T T T 
0.35; 
0.3 7 
0.25 ; 
0.2; 
0.15 7 
0.17 
0.05 ; 


T T T T T 


P(|Xw — p| > £) 


Ll 


4 4 1 i 


i 


0 i 1 
3 4-25 -2 -15 -1 -05 0 05 1 15 2 25 3 


Figure 6.3: The probabilistic guarantee of a sample average Xw is established by computing the 
probability of the tails. In this example, we assume that fxn (a) take a Gaussian shape, and we define 
€ = 1. Anything belonging to |X n — p| > € is called a undesired event B. If the probability of a 
undesired event is small, we say that we can offer a probabilistic guarantee. 


The moment we compute P[|X jy — | > €], we enter the race of probabilistic guarantee 
(e.g., 99.99%). Why? If the probability P[|X y — p| > €] is less than 0.01%, it still does not 
exclude the possibility that something bad will happen once every 10,000 trials on average. 
The chance is low, but it is still possible. We will learn some mathematical tools for analyzing 
this type of probabilistic guarantee. 


Plan for this chapter 


With these two main themes in mind, we now discuss the organization of this chapter. There 
are four sections: two for mathematical tools and two for main results. 


e Moment-generating functions: We have seen in Chapter 5 that the PDF of a sum of 
two random variables X +Y is the convolution of the two PDFs fx « fy. Convolutions 
are non-trivial, especially when we have more random variables to sum. The moment- 
generating functions provide a convenient way of summing N random variables. They 
are the transform domain techniques (e.g., Fourier transforms). Since convolutions in 


1Not all random variables have mean and variance, e.g., a Cauchy random variable, but most of them 
do. 


322 


time are multiplications in frequency, the moment-generating functions allow us to 
multiply PDFs in the transformed space. In this way, we can sum as many random 
variables as we want. We will discuss this idea in Section 6.1. 


Key Concept 1: Why study moment-generating functions? 


Moment-generating functions help us determine the PDF of X,+ X2+---+Xyn. 


Probability inequalities: When analyzing sample statistics such as X y, evaluating the 
exact probability could be difficult because it requires integrating the PDFs. However, 
if our ultimate goal is to estimate the probability, deriving an upper bound might be 
sufficient to achieve the goal. The probability inequalities are designed for this purpose. 
In Section 6.2, we discuss several of the most basic probability inequalities. We will 
use some of them to prove the law of large numbers. 


Key Concept 2: How can probability inequalities be useful? 


Probability inequalities help us upper-bound the bad event P[|X y — | > €]. 


Law of large numbers: This is the first main result of the chapter. The law of large 
numbers says that the sample average X yj converges to the population mean ps when 
the number of samples grows to infinity. The law of large numbers comes in two 
versions: the weak law of large numbers and the strong law of large numbers. The 
difference is the type of convergence they guarantee. The weak law is based on con- 
vergence in probability, whereas the strong law is based on almost sure convergence. 
We will discuss these types of convergence in Section 6.3. 


Key Concept 3: What is the law of large numbers? 


There is a weak law and a strong law of large numbers. The weak law of large 
numbers says that X y converges to the true mean pj, as N grows: 


lim P[|Xy — p| > =0. 
N-+0co 


Central Limit Theorem: The Central Limit Theorem says that the probability of 
Xy can be approximated by the probability of a Gaussian. You can also think of 
this as saying that the PDF of Xj is converging to a distribution that can be well 
approximated by a bell-shaped Gaussian. If we have many random variables and their 
sum is becoming a Gaussian, we can ignore the individual PDFs and focus on the 
Gaussian. Thus it explains why Gaussian is so popular. We will discuss this theorem 
in detail in Section 6.4. 


Key Concept 4: What is the Central Limit Theorem? 


The CDF of Xy can be approximated by the CDF of a Gaussian, as N grows. 
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6.1 Moment-Generating and Characteristic Functions 


Consider two independent random variables X and Y with PDFs fx(a) and fy(y), respec- 
tively. Let Z = X + Y be the sum of the two random variables. We know from Chapter 5 
that the PDF of Z, fz, is the convolution of fx and fy. However, we think you will agree 
that convolutions are not easy to compute. Especially when the sum involves more random 
variables, computing the convolution would be tedious. So how should we proceed in this 
case? One approach is to use some kind of “frequency domain” method that transforms 
the PDFs to another domain and then perform multiplication instead of the convolution 
to make the calculations easy or at least easier. The moment-generating functions and the 
characteristic functions are designed for this purpose. 


6.1.1 Moment-generating function 


Definition 6.1. For any random variable X , the moment-generating function (VGF) 
Mx (s) 1s 


Mx(s) =E [e**]. (6.1) 


The definition says that the moment-generating function (MGF) is the expectation of the 
random variable taken to the power e** for some s. Effectively, it is the expectation of a 
function of random variables. The meaning of the expectation can be seen by writing out 
the definition. For the discrete case, the MGF is 


Mx(s) = Ss e*px(a), (6.2) 
LEQ 
whereas in the continuous case, the MGF is 
Mei / Ste de. (6.3) 


The continuous case should remind us of the definition of a Laplace transform. For any 
function f(t), the Laplace transform is 


Lif](s) = i. f(t)e™ dt. 


From this perspective, we can interpret the MGF as the Laplace transform of the PDF. 
The argument s of the output can be regarded as the coordinate in the Laplace space. If 
s = —jw, then Mx (jw) becomes the Fourier transform of the PDF. 


Example 6.1. Consider a random variable X with three states 0,1,2 and with prob- 


ability masses 2, 2, § respectively. Find the MGF. 
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Solution. The moment-generating function is 


2 
Mx(s) = i[e®*] = 9%. el. ee? . 


Practice Exercise 6.1. Find the MGF for a Poisson random variable. 


Solution. The MGF of Poisson random variable can be found as 


co 


B[e* ] = wet eee = 


CoO 


(Ae*)" 


x! 
xz=0 


Practice Exercise 6.2. Find the MGF for an exponential random variable. 


Solution. The MGF of an exponential random variable can be found as 


a[e**] =| eee ade =| AES? da = = if A> s. 
0 0 


Why are moment-generating functions so called? The following theorem reveals the 
reason. 


Theorem 6.1. The MGF has the properties that 


Mx (0) ail 
4 Mx (s)|e-0 = E[X], 4Mx(s)|s=0 = ELX?], 


£Mx(s)|s=0 = E[X*], for any positive integer k. 


Mx (0) = Efe°*] = El] = 1. 


The third property holds because 


dk oe og i dee 
TaMx(s) = | ask ° fx (x) a= | xve** fx (x) da. 


—oo 60 


Setting s = 0 yields 


dk ~~ k 
EMR lao = fat x(a dx = E[X*). 


The second property is a special case of the third property. 
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The theorem tells us that if we take the derivative of the MGF and set s = 0, we will 
obtain the moment. The order of the moment depends on the order of the derivative. As a 
result, the MGF can “generate moments” by taking derivatives. This happens because of 
the exponential function e**. Since e** = xe**, the variable « appears whenever we take 


ds 
the derivative. 


Practice Exercise 6.3. Let X be a Bernoulli random variable with parameter p. 
Find the first two moments using MGF. 


Solution. The MGF of a Bernoulli random variable is 


Mx(s) = Ele™*] 
= e* nx (0) + e* px (1) 
= (1)(1—p) + (e*)(p) 
=1-—p+pe’*. 


The first and the second moment, using the derivative approach, are 


@| == (1 + ‘) 
S = —|1l—p+pe 
-9 ds s=0 
d 


2, 
Se ae pe’) 
s=0 ds? ( s=0 


Distribution PMF / PDF E[X]  Var[X] Mx(s) 
Bernoulli px (1) =p and px(0)=1-—p D p(1—p) 1—p+pe® 
Binomial px(k) = (j)p*(1 — p)?-* np np(l—p) (1—p+pe*)” 
: 1 D pee 
Geometr k) = p(1—p)k! 
eometric px( p(1 —p) ; 7) 1—(1— pe 
Mer ig 
Poisson px(k) = a aN aN eet 
1 x —p)? 262 
Gaussian fx(z) = ae exp { cae \ m o? exp {us + = \ 
Exponential fx(a) = Aexp {—Az} : u 
ox 1. i x —AS = 
" * e d bY d—s 
: 1 a+b (b—a)? e®) — @84 
fi y= 
Uniform fx (x a ; is ao) 


Table 6.1: Moment-generating functions of common random variables. 
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6.1.2 Sum of independent variables via MGF 


MGFs are most useful when analyzing the PDF of a sum of two random variables. The 
following theorem highlights the result. 


Theorem 6.2. Let X and Y be independent random variables. Let Z = X +Y. Then 


Mz(s) = Mx(s)My(s). (6.4) 


Proof. By the definition of MGF, we have that 
Mz(s) =E jer »)] (@) D [es*] q [e**] = Mx(s)My(s), 


where (a) is valid because X and Y are independent. 


Corollary 6.1. Consider independent random variables X,,..., Xn. Let Z = eae Xx. 
be the sum of random variables. Then the MGF of Z is 


(6.5) 


If these random variables are further assumed to be identically distributed, the MGF is 


Mz(s) = (Mx,(s))". (6.6) 


Proof. This follows immediately from the previous theorem: 


Mz(s) = Eles*1+t*)] = Efe**]E[e%*2] -- Eee] -[] Mx. 


n=1 


If the random variables . ...,Xy are i.i.d., then the product simplifies to 


Il Mx, ( -T] Mx,(s) = (Mx,(s))”. 


n=1 


Theorem 6.3 (Sum of Bernoulli = binomial). Let X,, ..., Xn be a sequence of 


i.t.d. Bernoulli random variables with parameter p. Let Z = X,+---+Xy be the sum. 
Then Z is a binomial random variable with parameters (N,p). 


Proof. Let us consider a sequence of i.i.d. Bernoulli random variables X,, ~ Bernoulli(p) 
forn=1,...,N. Let Z = X, +---+ Xy. The moment-generating function of Z is 


N 
Mz(s) = Ele(%2+"+Xx)] — TT Efe**>] 
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Now, let us check the moment-generating function of a binomial random variable: If Z ~ 
Binomial(N, p), then 


N 
Me(s) = Ble'™| =) e*(7! xa — py 


where the last equality holds because ane (ata = (a +b)%. Therefore, the two 
moment-generating functions are identical. 


Theorem 6.4 (Sum of binomial = binomial). Let X,, ..., Xn be a sequence of 


i.i.d. binomial random variables with parameters (n,p). Let Z = X,+---+ Xn be the 
sum. Then Z is a binomial random variable with parameters (Nn, p). 


Proof. The MGF of a binomial random variable is 
Mx,(s) = (pe* + (1—p))”. 


If we have N of these random variables, then Z = X; +---+ Xy will have the MGF 
N 
Mz(s) = [] Mx,(s) = (pe? + (1-p))”". 
i=l 


Note that this is just the MGF of another binomial random variable with parameter (Nn, p). 


Theorem 6.5 (Sum of Poisson = Poisson). Let X,, ..., Xn be a sequence of 


i.i.d. Poisson random variables with parameter . Let Z = X,+---+ Xy be the sum. 
Then Z is a Poisson random variable with parameters NA. 


Proof. The MGF of a Poisson random variable is 


oo k 
Mx(s) = Efe**] = Serer 


—A rAe* ere). 


Assume that we have a sum of N i.i.d. Poisson random variables. Then, by the main theorem, 
we have that 


Mz(s) = [Mx(s)]¥ = eN*-9, 


Therefore, the resulting random variable Z is a Poisson with parameter NX. 
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Theorem 6.6 (Sum of Gaussian = Gaussian). Let X1, ..., Xn be a sequence of 
independent Gaussian random variables with parameters ({11,07), ..., (un,o%)- Let 
Z=X,+--:-+Xy be the sum. Then Z is a Gaussian random variable: 


N N 
G= Gaussian( Se [Los Sp o;) : (6.7) 


oe i 


Proof. We skip the proof of the MGF of a Gaussian. It can be shown that 


2.2 
Mx(s) = exp {s+ > \. 


When we have a sequence of Gaussian random variables, then 


Mz(s) = Best +X) 
= Mx, (s) oe -Mx,,(s) 


ots? ons" 

= | exp4 Wis+ 5 “++ { exp 4 uns+ 5 
N N 32 
= ex n | s+ o,| =>. 
rn) (Ue) Ff 


Therefore, the resulting random variable Z is also a Gaussian. The mean and variance of Z 
are ae [ln and SL o2, respectively. 


6.1.3 Characteristic functions 


Moment-generating functions are the Laplace transforms of the PDFs. However, since the 
Laplace transform is defined on the entire right half-plane, not all PDFs can be transformed. 
One way to mitigate this problem is to restrict s to the imaginary axis, s = jw. This will 
give us the characteristic function. 


Definition 6.2 (Usual definition). The characteristic function of a random variable 


X is 


& x (jw) = Ele?”*]. (6.8) 


However, we note that since w can take any value in (—oo,0o), it does not matter if we 
consider E[e~/“*] or Efei”*]. This leads to the following equivalent definition of the char- 
acteristic function: 


Definition 6.3 (Alternative definition (for this book)). The characteristic function 
of a random variable X is 


®x (jw) = Ele**). (6.9) 
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If we follow this definition, we see that the characteristic function can be written as 


Co 
®x (jw) = Ele 7"*] = / e I" fy (x) da. (6.10) 
—co 

This is exactly the Fourier transform of the PDF. The reason for introducing this alternative 
characteristic function is that E[e~J**] is the Fourier transform of fx (2) but E[e/”*] is the 
inverse Fourier transform of fx (x). The former is more convenient (in terms of notation) 
for students who have taken a course in signals and systems. However, we should stress that 
the usual way of defining the characteristic function is E[e7”*]. 

A list of common Fourier transforms is shown in the table below. Additional identities 
can be found in standard signals and systems textbooks. 


Fourier Transforms 


f(t) > Fw) f(t) > FO) 
1 e u(t) 3 ==. a>0 10. sinc?(4*) + 22 A(4,) 
2. e*u(—t) -3 aT a>0O 11. e~* sin(wot)u(t) <> (a 
a @ les ap SD 12. e~ cos(wot)u(t) <> equ 
4, fe <> rae—l, a>0 13. ea <> V210e— oe 
te“ u(t) <> (CeCe a>0 14. d(t) ol 
6. t®e~*u(t) <> a a>0O 15. 1<-> 27d(w) 
7. rect(4) «+ rsinc(#Z) 16. d(t—tp) <> e 4¥%0 
8. sinc(Wt) <> Frect(s4,) 17. Jot ¢_+ 25 (w — wo) 
9. A(4) <-> Ssinc?(#2) 18. f(t)e?#ot 4 F(w — wo) 


Table 6.2: Fourier transform pairs of commonly used functions. 


Example 6.2. Let X be a random variable with PDF fx (x) = Ae~»* for x > 0. Find 
the characteristic function. 


Solution. The Fourier transform pair is 


1 
A+ jw 


hee Fie} = 


Therefore, the characteristic function is ®x(jw) = ,A5- 
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Example 6.3. Let X and Y be independent, and let 


Nemane x > 0, 
ie). 


NeW >Y, y = 0, 
0, y <0. 


fy (y) “| 


Find the PDF of Z7=X+Y. 


Solution. The characteristic function of X and Y can be found from the Fourier table: 


r r 


®x (jw) = ear and Oy (jw) = oo 


Therefore, the characteristic function of Z is 


Oz (jw) = Ox (jw) ®y (jw) = Seam 


By inverse Fourier transform, we have that 


2 
| 24 TAZ 


Why ®x(jw) but not M/x(s)? As we said, the function is not always defined. Recall 
that the expectation E[X] exists only when fx(x) is absolutely integrable, or E[|X|] < oo. 
For a characteristic function, the expectation is valid because E||e?“~* |] = E[1] = 1. However, 
for a function, E[|e**|] could be unbounded. To see a counterexample, we consider the 
Cauchy distribution. 


Theorem 6.7. Consider the Cauchy distribution with PDF 


1 
n(x? +1) 


fx(x) = 


The MGF of X is undefined but the characteristic function is well defined. 


Proof. The MGF is 


Co 1 Co 
Mx(s) = a ae 
x(s) J aa? +1) o> f € aa +1) xv 


oo 3 
> | ee dx, because e*” > (eye 
1 67 (a? + 1) 


ee (en)? a? i, 
> dx = dx = ov. 
> | 6r(QQz2) "lan fy 


Therefore, the MGF is undefined. On the other hand, by the Fourier table we know that 
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Example 6.4. Let Xo, X1,... be a sequence of independent random variables with 


PDF 
ak 
i) = (az + 22)’ 


Find the PDF of Y, where Y =). 9 Xx- 


Qk fork =0,1,.... 


~ ok+1 


Solution. From the Fourier transform table, we know that 


Ak > 1 a. & 
m(az+a2)  agm (a2 +2?) 


The characteristic function of Y is 


®y (jw) = II ®x, jw) = exp {ts ya} : 
k=0 


k=0 


; o0 — Pes i i il he see ; 
Since 7 ,=0 @ = Dopeo geet = 5 + G +°°: = 1, the characteristic function becomes 
®y (jw) = e|!. The inverse Fourier transform gives us 


ell = 


Therefore the PDF of Y is 


x > 0, 
48 <A) 


e¥, y = 0, 
0, y <0. 


and fr ( )={ 


Find the PDF of Z = max(X,Y) — min(X,Y). 
Solution. We first show that 
Z = max(X,Y)—min(X,Y) = |X —Y]. 
Suppose X > Y, then max(X,Y) = X and min(X,Y)=Y.SoZ=X—-Y.IfX <Y, 
then max(X,Y) = Y and min(X,Y) = X. So Z = Y — X. Combining the two cases 


gives us Z = |X — Y|. Now, consider the Fourier transform of the PDFs: 


ae ly ee 
1+ jw 
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Let U = X —Y, and let Z = |U|. The characteristic function is 


®y (jw) fa jews) = b[eF** Ret] 
; = ae —— : 
l+jw 1l-jw 1+0? fu(u) oe 


With the PDF of U, we can find the CDF of Z: 


Hence, the PDF is 


Closing remark. Moment-generating functions and characteristic functions are useful 
mathematical tools. In this section, we have confined our discussion to using them to com- 
pute the sum of two random variables. Later sections and chapters will explain further uses 
for these functions. For example, we use the MGFs when proving Chernoff’s bound and 
proving the Central Limit Theorem. 


6.2 Probability Inequalities 


Moment-generating functions and characteristic functions are powerful tools for handling the 
sum of random variables. We now introduce another set of tools, known as the probability 
inequalities, that allow us to do approximations. We will highlight a few basic probability 
inequalities in this section. 


6.2.1 Union bound 


The first inequality is the union bound we had introduced when we discussed the axioms of 
probabilities. The union bound states the following: 


Theorem 6.8 (Union Bound). Let A;,..., Ay be a collection of sets. Then 


N N 
Pl Ay) SP 
i n=1 
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Proof. We can prove this by induction. First, if N = 2, 
P[ Ay U Ag] = P[Ai] + P[A2] — P[A1N Ag] < P[Ai] + P[Ag], 


because P[A, M Ag] is a probability and so it must be non-negative. Thus we have proved 
the base case. Assume that the statement is true for N = K. We need to prove that the 
statement is also true for N = kK +1. To this end, we note that 


K+1 kK 
P U A, = °|(U 4] J Axes 
n=1 ae a 
—p U An 4 PIA K +1] —p (U 4] NAK+1 
i n=1 
<P U A,| + P[AK4+1]- 
n=1 


Then, according to our hypothesis for N = K, it follows that 


K K 
P| (J An| < >> PIAnl 
n=1 n=1 
Putting these together, 
K+1 K K+1 
P| LU) An| < > /PlAn| +PLAK+1] = > PILAR 
n=1 n=1 n=1 


Therefore, by the principle of induction, we have proved the statement. 


Remark. The tightness of the union bound depends on the amount of overlapping between 
the events Aj,..., An, as illustrated in Figure 6.4. If the events are disjoint, the union bound 
is tight. If the events are overlapping significantly, the union is loose. The idea of the union 
bound is the principle of divide and conquer. We decompose the system into smaller events 
for a system of n variables and use the union bound to upper-limit the overall probability. If 
the probability of each event is small, the union bound tells us that the overall probability 
of the system will also be small. 


Figure 6.4: Conditions under which the union bound is loose or tight. [Left] The union bound is loose 
when the sets are overlapping. [Right] The union bound is tight when the sets are (nearly) disjoint. 
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Example 6.6. Let X,...,Xy~ be a sequence of i.i.d. random variables with CDF 
Fx, (x) and let Z = min(X1,...,Xy). Find an upper bound on the CDF. 


Solution. Note that Z = min(Xj,...,Xy) < z is equivalent to at least one of the 
X,,’s being less than z. Thus, we have that 


Z=min(Xi,...,.X%v)<z @& Xy,<2U---UXn <z. 


Substituting this result into the CDF, 


Fz(z) = 


6.2.2 The Cauchy-Schwarz inequality 


The second inequality we study here is the Cauchy-Schwarz inequality, which we previously 
mentioned in Chapter 5. We review it for the sake of completeness. 


Theorem 6.9 (Cauchy-Schwarz inequality). Let X and Y be two random variables. 
Then 


E[XY? < E[X°]E[Y?]. (6.13) 


Proof. Let f(s) = E[(sX + Y)?] for any real s. Then 


f(s) =E[(sX +Y)?] 
= E[s*X? + 2sXY + Y?] 
= E[X?]s? + 2E[XY]s + E[Y?]. 


This is a quadratic equation, and f(s) > 0 for all s because E[(sX + Y)?] > 0. 
Recall that for a quadratic equation ¢(2) = ax? + bx +c, the function ¢(x) > 0 if and 
only if b? — 4ac < 0. Substituting this result into our problem, we show that 


(2E[XY])? — 4E[X?]E[Y?] < 0. 


This implies that 


XY}? < E[X7]E[Y”], 


which completes the proof. 


Remark. As shown in Chapter 5, the Cauchy-Schwarz inequality is useful in analyzing 
(|X Y]. For example, we can use the Cauchy-Schwarz inequality to prove that the correlation 
coefficient p is bounded between —1 and 1. 
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6.2.3 Jensen’s inequality 


Our next inequality is Jensen’s inequality. To motivate the inequality, we recall that 


Var[X] = E[X?] — E[X]?. 


Since Var|X] > 0 for any X, it follows that 


i], = mx (6.14) 
——" —_S —_—_ 
=E[9(X)] =9(E[X]) 


Jensen’s inequality is a generalization of the above result by recognizing that the inequality 
does not only hold for the function g(X) = X? but also for any convex function g. The 
theorem is stated as follows: 


Theorem 6.10 (Jensen’s inequality). Let X be a random variable, and letg: RR 


be a convex function. Then 


Elg(X)] > g(E|X]). (6.15) 


If the function g is concave, then the inequality sign is flipped: E[g(X)] < g(E[X]). The 
way to remember this result is to remember that E[X?] — E[X]? = Var[X] > 0. 

Now, what is a convex function? Informally, a function g is convex if, when we pick any 
two points on the function and connect them with a straight line, the line will be above the 
function for that segment. This definition is illustrated in Figure 6.5. Consider an interval 
[x,y], and the line segment connecting g(x) and g(y). If the function g(-) is convex, then 
the entire line segment should be above the curve. 


convex concave neither 


gly) 
g(x) 


z 


Figure 6.5: Illustration of a convex function, a concave function, and a function that is neither convex 
nor concave. 


The definition of a convex function essentially follows the above picture: 


Definition 6.4. A function g is convex if 


g(Ae + (1 — Ajy) S Ag(z) + (1 — Ajaty); 


for anyO<A<1. 


Here \ represents a “sweeping” constant that goes from x to y. When \ = 1 then Ax+(1—A)y 
simplifies to 7, and when A = 0 then Ax + (1 — A)y simplifies to y. 
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The definition is easy to understand. The left-hand side g(Av + (1—A)y) is the function 
evaluated at any points in the interval [x,y]. The right-hand side is the red straight line we 
plotted in Figure 6.5. It connects the two points g(x) and g(y). Convexity means that the 
red line is entirely above the curve. 

For twice-differentiable 1D functions, convexity can be described by the curvature of 
the function. A function is convex if 


g(x) = 0. (6.17) 


This is self-explanatory because if the curvature is non-negative for all x, then the slope of 
g has to keep increasing. 


Example 6.7. The following functions are convex or concave: 


e g(x) = log x is concave, because g/(x) = 4 and g" (x) = —4 < 0 for all z. 


x 


e g(x) = x? is convex, because g(x) = 2x and g’(x) = 2 is positive. 


e g(x) =e 


x 


is convex, because g/(z) = —e~” and g"(x) =e" * > 0. 


Why is Jensen inequality valid for a convex function? Consider the illustration in 
Figure 6.6. Suppose we have a random variable X taking some PDF fx(a). There is a 
convex function g(-) that maps the random variable X to g(X). Since g(-) is convex, a PDF 
like the one we see in Figure 6.6 will become skewed. (You can map the left tail to the new 
left tail, the peak to the new peak, and the right tail to the new right tail.) As you can see 
from the figure, the new random variable g(X) has a mean E[g(X)] that is greater than the 
mapped old mean g(E[X]). Jensen’s inequality captures this phenomenon by stating that 
ilg(X)] > g(E[X]) for any convex function g(-). 


E[X] 


Figure 6.6: Jensen’s inequality states that if there is a convex function g(-) that maps a random variable 
X to a new random variable g(X), the new mean E[g(X)] will be greater than the mapped old mean 


g(E[X]). 


Proving Jensen’s inequality is straightforward for a two-state discrete random variable. 
Define a random variable X with states 7 and y. The probabilities for these two states are 
P[X =a] =A and P[X = y] =1-— 4. Then 


IX]= S° a'px(a') = dv + (1- Ady. 
x'e{x.y} 
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Now, let g(-) be a convex function. We know from the expectation that 


{I9(X)}= SY) g(a')px(a’) = g(x) + (1- d)g(y)- 


x’ e{x,y} 


By convexity of the function g(-), it follows that 


g(Ax + (1—A)y) < Af (x) + (1 — A)g(y), 
eae Oe 


=9(E[X]) =E([9(X)] 


where in the underbrace we substitute the definitions using the expectation. Therefore, 
for any two-state discrete random variables, the proof of Jensen’s inequality follows directly 
from the convexity. If the discrete random variable takes more than two states, we can prove 
the theorem by induction. For continuous random variables, we can prove the theorem using 
the following approach. 


Here we present an alternative proof of Jensen’s inequality that does not require proof 
by induction. The idea is to recognize that if the function g is convex we can find a tangent 
line L(X) = aX +6 at the point E[X] that is uniformly lower than g(X), ie., g(X) > L(X) 
for all X. Then we can prove the result with a simple geometric argument. Figure 6.7 
illustrates this idea. 


Figure 6.7: Geometric illustration of the proof of Jensen’s inequality. Suppose g(-) is a convex function. 
For any point X on g(-), we can find a tangent line L(X) = aX + b. Since the black curve is always 
above the tangent, it follows that E[g(X)] > E[L(X)] for any X. Also, note that at a particular point 
E[X], the black curve and the red line touch, and so we have L(E[X]) = g(E[X]). 


Proof of Jensen’s inequality. Consider L(X) as defined above. Since g is convex, g(X) > 
L(X) for all X. Therefore, 


[g(X)] 2 E[L(X)| 


= L(E[X]) = g(E[X]), 


& 


where the last equality holds because L is a tangent line to g where they meet at E 
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What are (a,b) in the proof? By Taylor expansion, 
G(X) © g(E[X]) + g'(E[X])(X — E[X]) 
1X). 


Therefore, if we want to be precise, then a = g’(E[X]) and b = g(E[X]) — g'(E[X])E[X]. 


The end of the proof. 


Example 6.8. By Jensen’s inequality, we have that 


(a) E[X?] > E[X]?, because g(a) = x? is convex. 


(b) E[+] = ETX]? because g(x) = + is convex. 


i[log X] < log E[X], because g(x) = log x is concave. 


6.2.4 Markov’s inequality 
Our next inequality, Markov’s inequality, is an elementary inequality that links probability 


and expectation. 


Theorem 6.11 (Markov’s inequality). Let X > 0 be a non-negative random variable. 
Then, for any e > 0, we have 


(6.18) 


Markov’s inequality concerns the tail of the random variable. As illustrated in Figure 6.8, 
P[X > e] measures the probability that the random variable takes a value greater than e«. 
Markov’s inequality asserts that this probability P[|X > ¢] is upper-bounded by the ratio 
(| X]/e. This result is useful because it relates the probability and the expectation. In many 
problems the probability P|X > ¢] could be difficult to evaluate if the PDF is complicated. 
The expectation, on the other hand, is usually easier to evaluate. 

Proof. Consider <P[X > ¢]. It follows that 


eP[X > €] =f fx(x) dx < [ tfx(x) da, 


where the inequality is valid because for any « > € the integrand (which is non-negative) 
will always increase (or at least not decrease). It then follows that 


/ ” efx(e) de < | ” efx (e) de = E[X], 


A pictorial interpretation of Markov’s inequality is shown in Figure 6.9. For X > 0, it 
is not difficult to show that E[X] = [5° 1— Fx(x) dx. Then, in the CDF plot, we see that 
e+ P[X > ¢] is a rectangle covering the top left corner. This area is clearly smaller than the 
area covered by the function 1 — Fx (x). 
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Figure 6.8: Markov’'s inequality provides an upper bound to the tail of a random variable. The inequality 
states that the probability PLX > e] is upper bounded by the ratio ELX]/e. 


e+ P[X >] 


Figure 6.9: The proof of Markov’s inequality follows from the fact that « - P[X > e] occupies the 
top left corner marked by the yellow rectangle. The expectation is the area above the CDF so that 
E[X] = i? 1— Fx (a) dx. Since the yellow rectangle is smaller than the orange shaded area, it follows 


that ¢- PLX > e] < ELX], which is Markov’s inequality. 


Practice Exercise 6.4. Prove that if X > 0, then ELX] 


Solution. We start from the right-hand side: 


[i= Fx@ae= fiir ss}ar 


=| P[X > a] de 


=f [tee ee 
-[ [ tava 


= [ tx dt = 
0 


The change in the integration order is illustrated below. 
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ff - )dax dt [ fe - )dt dx 


How tight is Markov’s inequality? It is possible to create a random variable such that 
the equality is met (see Exercise 6.14). However, in general, the estimate provided by the 
upper bound is not tight. Here is an example. 


Practice Exercise 6.5. Let X ~ Uniform(0, 4). Verify Markov’s inequality for PLX > 
2], P[X > 3] and P[X > 4]. 


Solution. First, we observe that E[X] = 2. Then 


Therefore, although the upper bounds are all valid, they are very loose. 


If Markov’s inequality is not tight, why is it useful? It turns out that while Markov’s 
inequality is not tight, its variations can be powerful. We will come back to this point when 
we discuss Chernoff’s bound. 


6.2.5 Chebyshev’s inequality 
The next inequality is a simple extension of Markov’s inequality. The result is known as 


Chebyshev’s inequality. 


Theorem 6.12 (Chebyshev’s inequality). Let X be a random variable with mean w. 
Then for any e > 0 we have 


Var[X] 
ge 


PIX - ul >e]< (6.19) 


The tail measured by Chebyshev’s inequality is illustrated in Figure 6.10. Since the 
event |X — y| > € involves an absolute value, the probability measures the two-sided tail. 
Chebyshev’s inequality states that this tail probability is upper-bounded by Var[X]/e?. 
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Figure 6.10: Chebyshev’s inequality states that the two-sided tail probability P[|X — y| > e] is upper- 
bounded by Var[X]/e? 


Proof. We apply Markov’s inequality to show that 


P(X — pl 2 e] = P(X — ph)? > e*| 


IA 


An alternative form of Chebyshev’s inequality is obtained by letting « = ko. In this 
case, we have 
a il 


P(X — p| 2 ko] < Fa = Fe 


Therefore, if a random variable is k times the standard deviation away from the mean, then 
the probability bound drops to 1/k?. 


Practice Exercise 6.6. Let X ~ Uniform(0,4). Find the bound of Chebyshev’s 
inequality for the probability P[|X — y| > 1). 


Solution. Note that E[X] = 2 and o? = 47/12 = 4/3. Therefore, we have 


P[|X — pl > 1) < 


which is a valid upper bound, but quite conservative. 


Practice Exercise 6.7. Let X ~ Exponential(1). Find the bound of Chebyshev’s 
inequality for the probability PLX > ¢]. 


Solution. Note that E[X] = 1 and o? = 1. Thus we have 


PIX >el =P[X—pee—p) <Pl|X—pl 2e-y] 
4 1 
=-m? €-1" 
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We can compare this with the exact probability, which is 
PLX > e] =1—Fx(e) =e. 


Again, the estimate given by Chebyshev’s inequality is acceptable but too conservative. 


Corollary 6.2. Let X),...,Xwn be i.i.d. random variables with mean E[|X,,] = and 
variance Var[Xn] = 07. Let Xnvn= x ee Xn be the sample mean. Then 


— o2 
P| [Xn — pl > ( S53 (6.20) 


Proof. We can first show that E[X y] = w and Var[X y] satisfies 


N 
1 o 
n=1 


The consequence of this corollary is that the upper bound a? N/e? will converge to zero 
as N — oo. Therefore, the probability of getting the event {|Xn - Ll > e} is vanishing. 
It means that the sample average X y is converging to the true population mean pi, in the 
sense that the probability of failing is shrinking. 


6.2.6 Chernoff’s bound 


We now introduce a powerful inequality or a set of general procedures that gives us some 
highly useful inequalities. The idea is named for Herman Chernoff, although it was actually 
due to his colleague Herman Rubin. 


Theorem 6.13 (Chernoff’s bound). Let X be a random variable. Then, for any 
€ > 0, we have that 
IPS Sa) a. (6.21) 


where® 


ple) = max {se _ log Mx(s) (6.22) 


and Mx(s) is the moment-generating function. 


a 


y(e) is called the Fenchel-Legendre dual function of log Mx. See references [6-14]. 
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Proof. There are two tricks in the proof of Chernoff’s bound. The first trick is a nonlinear 
transformation. Since e*” is an increasing function for any s > 0 and x, we have that 


PX Se) =Ple* Se) 
S le | 
— esé 


© e-* Mx(s) 


= e 8étlog Mx(s) 
where the inequality (a) is due to Markov’s inequality. Step (b) just uses the definition of 
MGF that E[e**] = Mx(s). 
Now for the second trick. Note that the above result holds for all s. That means it 
must also hold for the s that minimizes e~**t+!°s “x (*), This implies that 


P|X > e] < min eo Mo : 
s>0 


Again, since e® is increasing, the minimizer of the above probability is also the maximizer 
of this function: 


v(e) = max {se — log Mx(s)} 


Thus, we conclude that P[X > e] < e~ ©), 


6.2.7 Comparing Chernoff and Chebyshev 


Let’s consider an example of how Chernoff’s bound can be useful. 

Suppose that we have a random variable X ~ Gaussian(0,0?/N). The number N can 
be regarded as the number of samples. For example, if Yj,...,Yx are N Gaussian random 
variables with mean 0 and variance a”, then the average X = x Seen Y, will have mean 
0 and variance o?/N. Therefore, as N grows, the variance of X will become smaller and 
smaller. 

First, since the random variable is Gaussian, we can show the following: 


Lemma 6.1. Let X ~ Gaussian(0, = be a Gaussian random variable. Then, for any 


e>0, 
| (6.23) 


Pix >)=1- 9 


where ® is the standard Gaussian’s CDF. 


Note that this is the exact result: If you tell me ¢, N, and a, then the probability PLX > ¢] 
is exactly the one shown on the right-hand side. No approximation, no randomness. 
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Proof. Since X is Gaussian, the probability is 


Pix2e= f° ne acum} 


=1 (Tag) -1-4( me). 


Let us compute the bound given by Chebyshev’s inequality. 


Lemma 6.2. Let X ~ Gaussian(0, ms) be a Gaussian random variable. Then, for any 
€ > 0, Chebyshev’s inequality implies that 


(6.24) 


Proof. We apply Chebyshev’s inequality by assuming that p= 0: 
P[X > ec] =P[X-p>e—p) <Pl|X-pl>e-yp] 
a[(X—p)7] _ 0 


(@—ne Ne 


We now compute Chernoff’s bound. 


Theorem 6.14. Let X ~ Gaussian(0, ay be a Gaussian random variable. Then, for 
any € > 0, Chernoff’s bound implies that 


(6.25) 


Proof. The MGF of a zero-mean Gaussian random variable with variance ¢?/N is Mx(s) = 


exp { os 3 Therefore, the function y can be written as 


persue (echenea) 


o2s 
= max SE — . 
s>0 2N 


To maximize the function we take the derivative and set it to zero. This yields 


d { — . Ne 
SE =0 s= : 


2N o? 
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os” 


Note that this s* is a maximizer because se — Sav is a concave function. 


Substituting s* into y(e), 


and hence 


Figure 6.11 shows the comparison between the exact probability, the bound provided 
by Chebyshev’s inequality, and Chernoff’s bound: 


© Exact: P[X > <]=1- 0 (“2=). 
e Chebyshev: P[X > «] < ae 
e Chernoff: PX > ¢] < exp {-s¥}. 


202 


In this numerical experiment, we set ¢ = 0.1, and 0 = 1. We vary the number JN. As we can 
see from the figure, the bound provided by Chebyshev is valid but very loose. It does not 
even capture the tail as N grows. On the other hand, Chernoff’s bound is reasonably tight. 
However, one should note that the tightness of Chernoff is only valid for large N. When N 
is small, it is possible to construct random variables such that Chebyshev is tighter. 

The MATLAB code used to generate this plot is illustrated below. 


% MATLAB code to compare the probability bounds 
epsilon = 0.1; 

sigma sls 

N = logspace(1,3.9,50); 

p_exact = 1-normcdf (sqrt (N)*epsilon/sigma) ; 


p_cheby = sigma*2./(epsilon*2*N) ; 
p_chern = exp(-epsilon*2*N/(2*sigma~2)) ; 


loglog(N, p_exact, ’-o’, ’Color’, [1 0.5 0], ’LineWidth’, 2); hold on; 
loglog(N, p_cheby, ’-’, ’Color’, [0.2 0.7 0.1], ’LineWidth’, 2); 
loglog(N, p_chern, ’-’, ’Color’, [0.2 0.0 0.8], ’LineWidth’, 2); 


What could go wrong if we insist on using Chebyshev’s inequality? Consider the fol- 
lowing example. 


Example 6.9. Let X ~ Gaussian(0,07/N). Suppose that we want the probability to 
be no greater than a confidence level of a: 


P[X > el] <a. 
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10° SPP PPE | 
ey I 
= Se es os oe | 
> L 
= 5 
= 10° | 
© 
re) 
fe) 
ay L 
40°19 - 
| | © Exact 
== Chebyshev 
| = Chernoff 
10 
10" 107 10° 


N 


Figure 6.11: Comparison between Chernoff's bound and Chebyshev's bound. The random variable we 
use is X ~ Gaussian(0,07/N). As N grows, we show the probability bounds predicted by the two 
methods. 


Let a = 0.05, ¢ = 0.1, and o = 1. Find the N using (i) Chebyshev’s inequality and (ii) 
Chernoff’s inequality. 
Solution: (i) Chebyshev’s inequality implies that 
P[X >e]< 
which means that 


ae?" 
If we plug in a = 0.05, ¢ = 0.1, and o = 1, then N > 2000. 


(ii) For Chernoff’s inequality, it holds that 


which means that 


Plugging in a = 0.05, « = 0.1, and o = 1, we have that N > 600. This is more than 3 
times smaller than the one predicted by Chebyshev’s inequality. Which one is correct? 
Both are correct but Chebyshev’s inequality is overly conservative. If N > 600 can 
make P[X > «] < a, then certainly N > 2000 will work too. However, N > 2000 is too 
loose. 
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6.2.8 Hoeffding’s inequality 
Chernoff’s bound can be used to derive many powerful inequalities. Here we present an 


inequality for bounded random variables. This result is known as Hoeffding’s inequality. 


Theorem 6.15 (Hoeffding’s inequality). Let X,,...,Xy be i.t.d. random variables 
with 0 < X, <1, and E[X,] = pw. Then 


P| LX <Als ( < 2¢-22N, 


where Xv =~ >> 


ov 


Proof. (Hoeffding’s inequality) First, we show that 


1x N 
P[Xy —p>el = | PMH =# [Soo 19 > a4 


_p Je? nan Xn—H) > ao) 


< — 


eseN es€ 


B[e8 naa (Xn—H)] ay 


Let Z, = X, — p. Then —p < Z, <1-— yp. At this point we use Hoeffding Lemma (see 
=2 
below) that E[e’4"] < e® because b — a = (1 — p) — (—p) = 1. Thus, 


fs dk IN sz\ N 
= q|e8@n os s2 
P[Xvy—-p>e] < (“5 ‘) < (2 aes Vs. 
esé 


This result holds for all s, and thus it holds for the s that minimizes the right-hand side. 
This implies that 


= 2N 
P[Xy —p>e| < min {exp |" — sex} }. 


Minimizing the exponent gives 4 {<x = seN} a * —eN = 0. Thus we have s = 4e. 


ds 8 
Hence, 


A 2 
P[Xyv —p>el <exp{ RE — dcjew} spo *, 


By symmetry, P[Xy — p< —e] < e~2©N Then by union bound we show that 


P||Xy —p| >€] =P[Xy -—p>e] +P[Xy—p< el 
< en2ON 4 6 22N 


2 
= Qe726 N. 
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Lemma 6.3 (Hoeffding’s lemma). Let a < X < b be a random variable with 
E|X] = 0. Then 


E [e**] < exp js F (6.27) 


Proof. Since a < X < b, we can write X as a linear combination of a and 6: 


X =b+(1—A)a, 


where \ = *=*. Since exp(-) is a convex function, it follows that erbt+(l-A)a < deb+(1—A)e*. 


(Recall that h is convex if h(Av + (1 — A)y) < Ah(aw) + (1 — A)A(y).) Therefore, we have 


eo* < ie" + = Die" 
_X-a abi Va 
p= a b-—a 


sa 


Taking expectations on both sides of the equation, 


p15 Xx Ta sb b sa 
4 < 
le l= 573° Le oe , 
because E[X] = 0. Now, if we let @ = —;“, then 
a sb b sa sb sa 
en + e°* = de®’ + (1— O)e 


= es (1 —O+ ger(t-2)) = (1 —O+ bert) e 59(b—a) 


= (l _ 0+ 0e")e Ou —e Out+log(1 Ree, 


eX] < Ee? by defining 


where we let u = s(b — a). This can be simplified as E 


o(u) = —Ou + log(1 — 6 + Oe"). 
The final step is to approximate ¢(u). To this end, we use Taylor approximation: 
/ a 1 
(u) = 6(0) + ug!(0) + 8"), 


for some € € [a,b]. Since ¢(0) = 0, ¢’(0) = 0, and ¢”(u) < + for all u, it follows that 


u u? _ s?(b—a)? 


8 8 


End of the proof. 


349 


CHAPTER 6. SAMPLE STATISTICS 


What is so special about the Hoeffding’s inequality? 


e Since Hoeffding’s inequality is derived from Chernoff’s bound, it inherits the 
tightness. Hoeffding’s inequality is much stronger than Chebyshev’s inequality 
in bounding the tail distributions. 


e Hoeffding’s inequality is one of the few inequalities that do not require E[X] and 
Var|X] on the right-hand side. 


e A downside of the inequality is that boundedness is not always easy to satisfy. 
For example, if X, is a Gaussian random variable, Hoeffding does not apply. 
There are more advanced inequalities for situations like these. 


Interpreting Hoeffding’s inequality. One way to interpret Hoeffding’s inequality is to 

write the equation as 

=, 2 

P[|Xy — p| > | aoe 
5 
which is equivalent to _ 
Pi |Xn - pl < >1-6. 

This means that with a probability at least 1 — 6, we have 

Xy—e€Sp<Xnte 


If we let 6 = 2e-2°, this becomes 


= 2 _ 1 2 
Ry 4) lee - 2 eX Ny 14] oe 2 
N an 08 5 SHS ANT Voy 085 (6.28) 


This inequality is a confidence interval (see Chapter 9). It says that with probability at 
least 1 — 6, the interval [X y — €, X y + €] includes the true population mean p. 
There are two questions one can ask about the confidence interval: 


e Given N and 6, what is the confidence interval? Equation (6.28) tells us that if we 
know JN, to achieve a probability of at least 1 — 6 the confidence interval will follow 
Equation (6.28). For example, if N = 10,000 and 6 = 0.01, ,/s4log= = 0.016. 
Therefore, with a probability at least 99%, the true population mean yp will be included 


in the interval _ _ 
Xv —0.16<u< Xn +0.16. 


e If we want to achieve a certain confidence interval, what is the N we need? If we are 
given ¢€ and 6, the N we need is 


§<2e2"°N = N> 


For example, if 6 = 0.01 and « = 0.01, the N we need is N > 26,500. 


When is Hoeffding’s inequality used? Hoeffding’s inequality is fundamental in modern 
machine learning theory. In this field, one often wants to quantify how well a learning 
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algorithm performs with respect to the complexity of the model and the number of training 
samples. For example, if we choose a complex model, we should expect to use more training 
samples or overfit otherwise. Hoeffding’s inequality provides an asymptotic description of 
the training error, testing error, and the number of training samples. The inequality is 
often used to compare the theoretical performance limit of one model versus another model. 
Therefore, although we do not need to use Hoeffding’s inequality in this book, we hope you 
appreciate its tightness. 


Closing Remark. We close this section by providing the historic context of Chernoftf’s 
inequality. Herman Chernoff, the discoverer of Chernoff’s inequality, wrote the following 
many years after the publication of the original paper in 1952. 

“In working on an artificial example, I discovered that I was using the Central Limit 
Theorem for large deviations where it did not apply. This led me to derive the asymptotic 
upper and lower bounds that were needed for the tail probabilities. [Herman] Rubin claimed 
he could get these bounds with much less work, and I challenged him. He produced a rather 
simple argument, using Markov’s inequality, for the upper bound. Since that seemed to be 
a minor lemma in the ensuing paper I published (Chernoff, 1952), I neglected to give him 
credit. I now consider it a serious error in judgment, especially because his result is stronger 
for the upper bound than the asymptotic result I had derived.” — Herman Chernoff, “A 
career in statistics,” in Lin et al., Past, Present, and Future of Statistical Science (2014), 
p. 35. 


6.3. Law of Large Numbers 


In this section, we present our first main result: the law of large numbers. We will discuss 
two versions of the law: the weak law and the strong law. We will also introduce two forms 
of convergence: convergence in probability and almost sure convergence. 


6.3.1 Sample average 


The law of large numbers is a probabilistic statement about the sample average. Suppose 
that we have a collection of i.i.d. random variables X1,...,X,;. The sample average of these 
N random variables is defined as follows: 


Definition 6.5. The sample average of a sequence of random variables X1,..., Xn 
1S 


1 N 
N= d, Xn (6.29) 


If the random variables X1,...,Xy are i.i.d. so that they have the same population 
mean E[X,,] = u (for n = 1,...,.N), then by the linearity of the expectation, 


E[Xn] = S > E[Xp] = p. 
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Therefore, the mean of Xy is the population mean p. 

The sample average, X y, plays an important role in statistics. For example, by sur- 
veying 10,000 Americans, we can find a sample average of their ages. Since we never have 
access to the true population mean, the sample average is an estimate, and since X y is only 
an estimate, we need to ask how good the estimate is. 

One reason we ask this question is that Xy is a finite-sample “approximation” of i. 
More importantly, the root of the problem is that Xy itself is a random variable because 
X1,...,Xy are all random variables. Since Xy is a random variable, there is a PDF of 
X y; there is a CDF of Xy; there is E[X y]; and there is Var[X y]. Since X jy is a random 
variable, it has uncertainty. To say that we are confident about X y, we need to ensure that 
the uncertainty is within some tolerable range. 


How do we control the uncertainty? We can compute the variance. If X,,...,Xy are 
iid. random variables with the same variance Var[X,] = 0? (for n =1,...,N), then 
it ee 
Var | [Xn] = Ga LvalX [=i a 
n=1 


Therefore, the variance will shrink to 0 as N grows. In other words, the more samples we 
use to construct the sample average, the less deviation the random variable X y will have. 


Visualizing the sample average 


To help you visualize the randomness of X jy, we consider an experiment of drawing N 
Bernoulli random variables X1,...,X jy with parameter p = 1/2. Since X,, is Bernoulli, it 
follows that 


i[X,] =p and Var[X,] = p(1 — p). 


We construct a sample average X y = 4 4 Xp. Since X, is a Bernoulli random variable, 
we know everything about Xj. First, Xv is a binomial random variable, since Xy is the 
sum of Bernoulli random variables. Second, the mean and variance of X y are respectively 


def > 1 7 
Hxy = E[Xw] = 5 2d [Xn] =p, 


2 def _ p(l—p) 
O= Var = ae i Val = _ ww 


In Figure 6.12, we plot the random variables X y (the black crosses) for every N. You 
can see that at each N, e.g., N = 100, there are many possible observations for X y because 
Xy itself is a random variable. As N increases, we see that the deviation of the random 
variables becomes smaller. In the same plot, we show the bounds 4+ 30x%,,, which are three 
standard deviations from the mean. We can see clearly that the bounds provide a very good 
envelope covering the random variables. As N goes to infinity, we can see that the standard 
deviation goes to zero, and so X y approaches the true mean. 

For your reference, the MATLAB code and the Python code we used to generate the 
plot are shown below. 


% MATLAB code to illustrate the weak law of large numbers 
Nset = round(logspace(2,5,100)); 
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0.7 


sample average 
So 
ol 


0.3 
10? 10° 104 10° 
N 
Figure 6.12: The weak law of large numbers. In this plot, we assume that X1,..., Xw are i.i.d. Bernoulli 
random variables with a parameter p. The black crosses in the plot are the sample averages 
Xn = > you Xn. The red curves are the ideal bounds px, + 30x%,,, where ux, = p and 


Ox, = Vp(1—p)/N. As N grows, we observe that the variance shrinks to zero. Therefore, the 
sample average is converging to the true population mean. 


for i=1:length(Nset) 
N = Nset(i); 
p = 0.5; 
x(:,i) = binornd(N, p, 1000,1)/N; 
end 
y = x(1:10:end,:)’; 
semilogx(Nset, y, ’kx’); hold on; 
semilogx(Nset, p+3*sqrt(p*(1-p)./Nset), ’r’, ’?LineWidth’, 4); 
semilogx(Nset, p-3*sqrt(p*(1-p)./Nset), ’r’, ’LineWidth’, 4); 


# Python code to illustrate the weak law of large numbers 
import numpy as np 
import matplotlib.pyplot as plt 
import scipy.stats as stats 
import numpy.matlib 
p= 0.5 
Nset = np.round(np.logspace(2,5,100)) .astype (int) 
x = np.zeros((1000,Nset.size)) 
for i in range(Nset.size): 
N = Nset [i] 
x[:,i] = stats.binom.rvs(N, p, size=1000)/N 
Nset_grid = np.matlib.repmat(Nset, 1000, 1) 


plt.semilogx(Nset_grid, x,’ko’); 
plt.semilogx(Nset, p + 3*np.sqrt((p*(1-p))/Nset), ’r’, linewidth=6) 
plt.semilogx(Nset, p - 3*np.sqrt((p*(1-p))/Nset), ’r’, linewidth=6) 
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Note the outliers for each N in Figure 6.12. For example, at N = 10? we see a point 
located near 0.7 on the y-axis. This point is outside three standard deviations. Is it normal? 
Yes. Being outside three standard deviations only says that the probability of having this 
outlier is small. It does not say that the outlier is impossible. Having a small probability does 
not exclude the possibility. By contrast, if you say that something will surely not happen you 
mean that there is not even a small probability. The former is a weaker statement than the 
latter. Therefore, even though we establish a three standard deviation envelope, there are 
points falling outside the envelope. As N grows, the chance of having a bad outlier becomes 
smaller. Therefore, the greater the N, the smaller the chance we will get an outlier. 

If the random variables X,, are i.i.d., the above phenomenon is universal. Below is an 
example of the Poisson case. 


Practice Exercise 6.8. Let X, ~ Poisson(A). Define the sample average as X y = 
+ ae X,,. Find the mean and variance of X y. 


Solution. Since X,, is Poisson, we know that E[X,,] = A and Var[X,] = \. So 


Therefore, as N — oo, the variance Var[X y] — 0. 


6.3.2 Weak law of large numbers (WLLN) 


The analysis of Figure 6.12 shows us something important, namely that the convergence 
in a probabilistic way is different from that in a deterministic way. We now describe one 
fundamental result related to probabilistic convergence, known as the weak law of large 
numbers. 


Theorem 6.16 (Weak law of large numbers). Let Xj,...,Xy be a set of i.1.d. ran- 
dom variables with mean j and variance o*. Assume E[X?] < co. Let Xy = + oe XG 
Then for any ¢ > 0, 


lim P|iX = | = (6.30) 
N- oo 


Proof. By Chebyshev’s inequality, 


P[|Xw — pl > ¢] <S 
Therefore, setting N — oo we have 


gee Weel ee 
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Example 6.10. Consider a set of i.i.d. random variables X1,...,X ) where 
X, ~ Gaussian(p, 07). 


Verify that the sample average X y = + es X,, follows the weak law of large num- 
bers. 


Solution: Since X,, is a Gaussian, the sample average X y is also a Gaussian: 
— o2 
X wy ~ Gaussia —)}. 
N ussian (1, _ 
Consider the probability P [|X — u| > e] for each N: 


on SP|IXx — pl > | 


=P[Xy—n>e] +P[Xw—n<-¢ 
s24 (2%) ra( | 


n0 (24). 


If we set 0 = 1 and « = 0.1, then 


) =(9208. . 4-90 (- = 


aera le 
dip = 28 (-2) =0.7518, 199 = 2® (--7) = 0.3173, 


_ 9.1 - v1000 aa = 0.0016. 


51000 = 2® ( 


As you can see, the the sequence 61, 62,...,6n,..- rapidly converges to 0 as N grows. 
In fact, since ®(z) is a increasing function for z < 0 with ®(—oo) = 0, it follows that 


sim, P| — > | = Jim 26 (- 


The weak law of large numbers is portrayed graphically in Figure 6.13. In this figure 
we draw several PDFs of the sample average Xy. The shapes of the PDFs are getting 
narrower as the variance of the random variable shrinks. Since the PDFs become narrower, 
the probability P{|Xj — u| > €] becomes more unlikely. At the limit when N — ov, the 
probability vanishes. The weak law of large numbers asserts that this happens for any set of 


iid. random variables. It says that the sequence of probability values 6, ae P[|X nv —p| > €] 
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will converge to zero. 


Weak law of large numbers 


Figure 6.13: The weak law of large numbers states that as N increases, the variance of the sample 
average X shrinks. As a result, the probability P[|X jy — | > €] decreases and eventually vanishes. 
Note that the convergence here is that of the sequence of probabilities P[|X 1 — u| > e], which is just 
a sequence of numbers. 


What is the weak law of large numbers? 


Let Xy be the sample average of i.i.d. random variables X,,...,Xy. 
lim P|iX —p|> | 0): (6.31) 
N—- oo 


For details, see Theorem 6.16. 


The WLLN concerns the sequence of probability values dy = P[|X yn — py > €]. 


The probabilities converge to zero as N grows. 


It is weak because having a small probability does not exclude the possibility of 
happening. 


6.3.3 Convergence in probability 


The example above tells us that in order to show convergence, we need to first compute the 
probability 6,, of each event and then take the limit of the sequence, e.g., the one shown in 


the table below: 
=n ees 


by d5 d10 5100 31000 410000 
0.9203 0.8231 0.7518 0.3173 0.0016 1.5240x10~78 
——EE—EEEEeEeEeEE——E—E—————————————— as 
Therefore, the convergence is the convergence of the probability. Since {61,62,...} is a 
sequence of real numbers (between 0 and 1), any convergence results for real numbers apply 
here. 

Note that the convergence controls only the probabilities. Probability means chance. 
Therefore, having the limit converging to zero only means that the chance of happening is 
becoming smaller and smaller. However, at any N, there is still a chance that some bad 
event can happen. 
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What do we mean by a bad event? Assume that X, are fair coins. The sample average 
Xy = (1/N) Ss X,, is more or less equal to 1/2 as N grows. However, even if N is a 
large number, say N = 1000, we are still not certain that the sample average is exactly 1/2. 
It is possible, though very unlikely, that we obtain 1000 heads or 1000 tails (so that the 
sample average is “1” or “0”). The bottom line is: Having a probability converging to zero 
only means that for any tolerance level we can always find an N large enough so that the 
probability is smaller than that tolerance. 

The type of convergence described by the weak law of large numbers is known as the 
convergence in probability. 


Definition 6.6. A sequence of random variables A;,..., Aj converges in probability 
to a deterministic number a if for every « > 0, 


lim P[|Ay — al >] =0. (6.32) 
N- oo 


We write Ay + a to denote convergence in probability. 


The following two examples illustrate how to prove convergence in probability. 


Example 6.11. Let X1,...,Xy be i.i.d. random variables with X,, ~ Uniform(0, 1). 
Define Ay = min(X,,...,Xy). Show that Ay converges in probability to zero. 


Solution. (Without determining the PDF of An, we notice that as N increases, the 
value of Ay will likely decrease. Therefore, we should expect Ay to converge to zero.) 
Pick an e > 0. It follows that 


P[|Aw — 0| > ¢] = P[min(X%,...,Xw) > ¢], because X,, > 0 
=P[X, >cand--: and Xn >¢| 
= P(X, >e)---P(Xy >e) =(1-6)%. 


Setting the limit of N — oo, we conclude that 
: —s = _— 7; SAN = ; 
Jim P[|An — 0| > ¢] Jim (1 é) 0 


Therefore, Ay converges to zero in probability. 


Practice Exercise 6.9. Let X ~ Exponential(1). By evaluating the CDF, we know 
that PLX > a] = e~*. Let Ay = X/N. Prove that Ay converges to zero in probability. 


Solution. For any ¢ > 0, 


P||Ay — 0| > e] = 


357 


CHAPTER 6. SAMPLE STATISTICS 


Putting N — oo on both sides of the equation gives us 


lim Pi|Ay—0|><4\= lim e “© =0. 
N->0o N-0o 


Thus, Ay converges to zero in probability. 


Example 6.12. Construct an example such that Ay converges in probability to some- 
thing, but E[Ay] does not converge to the same thing. 


Solution. Consider a sequence of random variables Ay such that 
a=0, 


P[Ay =a]= a=N?, 


otherwise. 


Figure 6.14: Probability density function of the random variable Ayn. 


We first show that Ay converges in probability to zero. Let ¢ > 0 be a fixed 
constant. Since € > 0, 
il 


for any N > ,/e. Therefore, we have that 


lim P[|Ay —0| >] 
N- oo 


Hence, Ay converges to 0 in probability. 
However, E[Ay] does not converge to zero, because 


’|An] goes to infinity as N grows. 
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6.3.4 Can we prove WLLN using Chernoff’s bound? 


The following discussion of using Chernoff’s bound to prove WLLN can be skipped if this 


is your first time reading the book. 


In proving WLLN we use Chebyshev’s inequality. Can we use Chernoff’s inequality (or 
Hoeffding’s) to prove the result? Yes, we can use them. However, notice that the task here is 
to prove convergence, not to find the best convergence. Finding the best convergence means 
finding the fastest decay rate of the probability sequence. Chernoft’s bound (and Hoeffding’s 
inequality) offers a better decay rate. However, Chernoff’s bound needs to be customized for 
individual random variables. For example, Chernoff’s bound for Gaussian is different from 
Chernoff’s bound for exponential. This result makes Chebyshev the most convenient bound 
because it only requires the variance to be bounded. 

What if we insist on using Chernoff’s bound in proving the WLLN? We can do that for 
specific random variables. Let’s consider two examples. The first example is the Gaussian 
random variable where X,, ~ N (0,07). We know that X y ~ N(0,07/N). Chernoff’s bound 


shows that ; 
_ ev N 
Pl|Xn - <2 -— 
(| N L| > | S exp { G2 \ ’ 
Taking the limit on both sides, we have 


lim P [|X —p|>e] = lim 2 4 
- e| = lim 2exp, -— >? = 

Roe a M N-oo P 202 
Note that the rate of convergence here is exponential. The rate of convergence offered by 
Chebyshev is only linear. Of course, you may argue that since X, is Gaussian we have 
closed-form expressions about the probability, so we do not need Chernoff’s bound. This is 
a legitimate point, and so here is an example where we do not have a closed-form expression 
for the probability. 

Consider a sequence of arbitrary i.i.d. random variables X1,..., Xj withO < X, <1. 
Then Hoeffding’s inequality tells us that 


P [|Xw — | > e] < 2exp {-2e7N}. 
Taking the limit on both sides, we have 
: — 7 2 
jim P [|Xn — | >e] = Jim 2exp {—2c°N} = 0. 
Again, we obtain a WLLN result, this time for iid. random variables Xj,..., Xj with 
0< X, <1. 


As you can see from these two examples, WLLN can be proved in multiple ways 
depending on how general the random variables need to be. 


End of the discussions. 
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6.3.5 Does the weak law of large numbers always hold? 


The following discussion of the failure of the weak law of large numbers can be skipped if 


this is your first time reading the book. 


The weak law of large numbers does not always hold. Recall that when we prove the 
weak law of large numbers using Chebyshev’s inequality, we implicitly require that the 
variance Var[X y] is finite. (Look at the condition that E[X?] < co.) Thus for distributions 
whose variance is unbounded, Chebyshev’s inequality does not hold. One example is the 
Cauchy distribution. The PDF of a Cauchy distribution is 


oe 
Ix(@) = ate a)" 


where ¥ is a parameter. Letting y = 1, 

se 1 1 
b[X?] = i= 1 d 
al (ee 7 fh. 142? 5 


1° 1 f° 1 1 
== dx = tan" 
-f dx -{ = xv |. an | 


Since the second moment is unbounded, the variance of X will also be unbounded. 

A perceptive reader may observe that even if E[X?] is unbounded, it does not mean 
that the tail probability is unbounded. This is correct. However, for Cauchy distributions, 
we can show that the sample average X y does not converge to the mean when N -+ oo 
(and so the WLLN fails). To see this, we note that 


L=—OCoO 


So for the sample average X y = + ae X,, the characteristic function is 


_ wl N 
fe~SeXN] — Ble Ena Xn] = fe~ Xm] = le“ | = eo lel, 


which remains a Cauchy distribution with y = 1. Therefore, we have that 


PIXwi<el= fo an 


1 ee 1. dt 
i m(1 + x?) aa | m(1 + @?) oe aa ca (e) 


Thus no matter how many samples we have, P[{|Xy| < ¢] will never converge to 1 (so 
P{|X w| > €] will never converge to 0). Therefore, WLLN does not hold. 


End of the discussion. 
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6.3.6 Strong law of large numbers 


Since there is a “weak” law of large numbers, you will not be surprised to learn that there 
is a strong law of large numbers. The strong law is more restrictive than the weak law. Any 
sequence satisfying the strong law will satisfy the weak law, but not vice versa. Since the 
strong law is “stronger”, the proof is more involved. 


Theorem 6.17 (Strong law of large numbers). Let X1,...,Xy be a sequence of 
i.i.d. random variables with common mean js and variance o?. Assume E[X*] < oo. 
Let Xv = a Si Xp, be the sample average. Then 


P Ninn ep en al (6.33) 


N—- co 


The strong law flips the order of limit and probability. As you can see, the difference 
between the strong law and the weak law is the order of the limit and the probability. In the 
weak law, the limit is outside the probability, whereas, in the strong law, the limit is inside 
the probability. This switch in order makes the interpretation of the result fundamentally 
different. In the final analysis, the weak law concerns the limit of a sequence of probabilities 
(which are just real numbers between 0 and 1). However, the strong law concerns the limit 
of a sequence of random variables. The strong law answers the question, what is the limiting 
object of the sample average as N grows? 


The strong law concerns the limiting object, not a sequence of numbers. What 
is the “limiting object”? If we denote Xy as the sample average using N samples, then 
we know that X, is a random variable, X is a random variable, and all X,,’s are random 
variables. So we have a sequence of random variables. As N goes to infinity, we can ask about 
the limiting object limy_,., X y. However, even without any deep analysis, you should be 
able to see that limy_,.. Xn is another random variable. The strong law says that this 
limiting object will “successfully” become a deterministic number jz, after a finite number 
of “failures”. 


The strong law asserts that there are a finite number of failures. Let us explain 
“success” and “failure”. Xy is a random variable, so it fluctuates. However, as N goes to 
infinity, the strong law says that the number of times where Xy 4 p will be zero. That 
is, there is a finite number of times where Xy 4 uw (ie., fail), and afterward, you will be 
perfectly fine (i.e., success). Yes, perfectly fine means 100%. The weak law only guarantees 
99.99%. 

A good example for differentiating the weak law and the strong law is an electronic 
dictionary that improves itself every time you use it. The weak law says that if you use 
the dictionary for a long period, the probability of making an error will become small. You 
will still get an error once in a while, but the probability is very small. This is a 99.99% 
guarantee, and it is the weak law. The strong law says that the number of failures is finite. 
After you have gone through this finite number of failures, you will be completely free of 
error. This is a 100% guarantee by the strong law. When will you hit this magical number? 
The strong law does not say when; it only asserts the existence of this number. However, this 
existence is already good enough in many ways. It gives a certificate of assurance, whereas 
the weak law still has uncertainty. 
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Strong law # deterministic. If the strong law offers a 100% guarantee, does it mean 
that it is a deterministic guarantee? No, the strong law is still a probabilistic statement 
because we are still using P[-] to measure an event. The event can include measure-zero 
subsets, and the measure-zero subsets can be huge. For example, the set of rational numbers 
on the real line is a measure-zero set when measuring the probability using an integration. 
The strong law does not handle those measure-zero subsets. 


6.3.7 Almost sure convergence 


The discussion below can be skipped if this is your first time reading the book. 


The type of convergence used by the strong law of large numbers is the almost sure 
convergence. It is defined formally as follows. 


Definition 6.7. A sequence of random variables A,,..., 4n converges almost surely 
to a if 


P| lim Ay =al =1. (6.34) 


N—-0o 


. a.s. 
We write Ay —+ a to denote almost sure convergence. 


To prove almost sure convergence, one needs to show that the sequence Ay will demonstrate 
Ay #0 for a finite number of times. Afterward, Ay needs to demonstrate Ay = a. 


Example 6.13.° Construct a sequence of events that converges almost surely. 
Solution. Let X,,...,Xy be ii.d. random variables such that X,, ~ Uniform(0, 1). 


Define Ay = min(X),...,Xy). Since Ay is nonincreasing and is bounded below by 
zero, it must have a limit. Let us call this limit 
def 
A= lim Ay. 
are M 
Then we can show that 


P[A > €] = P[min(X1, Xo,...) > €] 


P[min(X,, X2,...,Xw) > €] 


P[X, >eand X2 >eand --- and Xy >€| 
= (l—e)%, 


where (a) holds because there are more elements in (Xj, X2,...) than in 
(X1, Xo,...,Xy). Therefore, the minimum value of the former is less than the mini- 
mum value of the latter. (b) holds because if min(X1, X2,...,Xy) > ¢, then X, > € 
for all n. 
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Since P[A > €] < (1—)% for any N, the statement still holds as N > oo. Thus, 


PIASe)< lim (—e)” =0. 


N-0o 


This shows P[A > e|] = 0 for any positive ¢. So P[A > «] = 0, and hence P[A = 0] = 1. 
Since A is the limit of Ay, we conclude that 


P| lie A 
Noo 


So Ay converges to 0 almost surely. 


“This example is modified from Bertsekas and Tsitsiklis, Introduction to Probability, Chapter 5.5. 


Example 6.14.° Construct an example where a sequence of events converges in prob- 
ability but does not converge almost surely. 


Solution. Consider a discrete time arrival process. The set of times is partitioned into 
consecutive intervals of the form 


ia 2a ahs 


Ip = {4,5, 6, 7}, 
Iz = {8,9,10,..., 15}, 


PSO OP eat 


Therefore, the length of each interval is |Ii| = 2, |I2| =4, ..., |Jz| = 2°. 
During each interval, there is exactly one arrival. Define Y,, as a binary random 
variable such that for every n € I, 


with probability Tal 

with probability 1 — Ta 
For example, if n € {2,3}, then P[Y, = 1] = $. Ifn€ {4,5,6,7}, then P[Y, = 1] 
In general, we have that 


1 1 
lim PY) i 


n—0o n—>0o |\Iz,| 5 2.00 


='0, 


and hence 


Therefore, Y,, converges to 0 in probability. 
However, when we carry out the experiment, there is exactly one arrival per 
interval according to the problem conditions. Since we have an infinite number of 
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intervals [,,[2,..., we will have an infinite number of arrivals in total. As a result, 
Y, = 1 for infinitely many times. We do not know which Y,, will equal 1 and which 
Y, will equal to 0. However, we know that there are infinitely many Y,, that are equal 
to 1. Therefore, in the sequence Yj, Y2,...,Yn,..-, we must have that the tail of the 
sequence is 1. (If Y, stops being 1 after some n, then we will not have an infinite 
number of arrivals in total.) 

Since Y,;, = 1 when n is large enough, it follows that 


P| lim eal = 
N— Co 


Equivalently, we can say that the sequence Y,, will never take the value 0 when n is 
large enough. Thus, 


P [lim a 0| = 0 


Therefore, Y,, does not converge to 0 almost surely. 


*This example is modified from Bertsekas and Tsitsiklis, Introduction to Probability, Chapter 5.5. 


End of the discussions. 


6.3.8 Proof of the strong law of large numbers 


The strong law of large numbers can be proved in several ways. We present a proof based on 
Bertsekas and Tsitsiklis, Introduction to Probability, Problems 5.16 and 5.17, which require 
a finite fourth moment E[X4] < oo. An alternative proof that requires only E[X,] < oo is 
from Billingsley, Probability and Measure, Theorem 22.1. 


The proof of the strong law of large numbers is beyond the scope of this book. This 
section is optional. 


Lemma 6.4. Consider non-negative random variables X1,...,Xn. Assume that 


} EE x,| < 00. (6.35) 


Proof. Let S = 4 X,,. Note that S is a random variable, and our assumption is that 

q[:S] < co. Thus, we argue that S' < oo with probability 1. If not, then S will have a positive 

probability of being oo. But if this happens, we will have E[S] = co because (by the law of 

total expectation): 

£[S] = E[S | S = infinite]P[S = infinite] + E[S | S = finite]P[S = finite]. 
mw@——__| 


=00 
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Now, since S is finite, the sequence {X1,..., Xx,...} must converge to zero. Otherwise, 
if X,, is converging to some constants c > 0, then summing the tail of the sequence (which 
contains infinitely many terms) gives infinity: 

S=xX ait eats Xn test. 
a 
eae =infinite 

Since the probability of S being finite is 1, it follows that {X1,..., Xv} is converging 
to zero with probability 1. 


Theorem 6.18 (Strong law of large numbers). Let X1,...,Xy be a sequence of 
iid. random variables with common mean ys and variance o*. Assume E[X4] < ov. 
Let Xv = a uae Xp, be the sample average. Then 


P im Xv = 1| =. (6.36) 


N—-0o 


Proof. We first prove the case where E[X,,] = 0. To establish that X y — 0 with probabil- 
ity 1, we use the lemma to show that 


[Ere <e 


But to show E[}°¥_, |X wn] < 00, we note that |z| < 1+2*. Therefore, E[(¥_, |Xwnl] < 
1+ED yy 1 ven and hence we just need to show that 


by | < 00. 


Let us expand the term ee as follows: 


N N 


[Xv] = a » > 3 d, "Xn XneXngXnal- 


ny=1 ng=1 n3=1 n4= 
There are five possibilities for E.Xn,Xn,Xn;Xn,]! 
e All indices are different. Then 
EX, XnoXngXn4] = E[Xn, JE[Xn,JE[Xn,|E[Xn,] =0-0-0-0=0. 


e One index is different from other three indices. For example, if n; is different from 
n2,n3,nN4, then 


E[XnXneXnsXnq] — [Xn] *[XneXngXnu] =0- *[XnaXnsXna] =0. 


e Two indices are identical. For example, if ny = n3, and ng = na, then 


E[Xny XnpXngXn4] = [Xn Xns] EX Xn] = a[X7 xX? I. 


There are altogether 3N(N — 1) of these cases: N(N — 1) comes from choosing N 
followed by choosing N —1, and 3 accounts for ny = ng A nz = N4, ny =N3 ANQ = Na, 
and ny = n4 # N2 = 13. 
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e Two indices are identical, and two indices are different. For example, if ny = n3 but 
ng and nq are different. Then 


{[Xny XnpXnzXna] = [Xn Xns] "[Xno] [Xn] 
= E[X?]-0-0=0. 


e All indices are identical. If ny = no = n3 = n4, then 


Xn, XnoXngXn4] = [Xa |. 


ny 


There are altogether N cases of this. 


Therefore, it follows that 


NE[X#] + 3N(N — 1)E[X?X2] 
N4 


[Xn] = 


Since xy < (a? + y)/2, it follows that 


[X7X3] < E[(Xf)? + (X2)"]/2 


= ELX¢ + X9]/2 
= E[X?). 


Substituting into the previous result, 


at, — NE[X4] + 3N(N — 1)E[X$] 


3N2_ 

< We [XT] 
ee 

= N2 UB Sae 


Now, let us complete the proof. 


z > | <E > a x <0, 
N=1 


N=1 


because }>)_,(1/N?) is the Bassel problem with a solution that }>y_,(1/N?) = 77/6. 
Consequently, we have shown that E bac Xn < oo, which implies E [4 _, |X n|] < co. 


Then, by the lemma, we have X y converging to 0 with probability 1, which proves the result. 

If E[X,,] = p, then just replace X,, with Y,, = X,—p in the above arguments. Then we 
can show that Y y converges to 0 with probability 1, which is equivalent to X jy converging 
to 2 with probability 1. 


End of the proof of strong law of large numbers. 
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6.4 Central Limit Theorem 


The law of large numbers tells us the mean of the sample average X y = (1/N) Xn. 
However, if you recall our experiment of throwing N dice and inspecting the PDF of the 
sum of the numbers, you may remember that the convolution of an infinite number of 
uniform distributions gives us a Gaussian distribution. For example, we show a sequence 
of experiments in Figure 6.15. In each experiment, we throw N dice and count the sum. 
Therefore, if each face of the die is denoted as X,,, then the sum is X; +---+ Xj. We plot 
the PDF of the sum. As you can see in the figure, X; +---+ Xy converges to a Gaussian. 
This phenomenon is explained by the Central Limit Theorem (CLT). 


0.2 , : : : : : 0.2;-— . - . : . 0.12 


0.1} 
0.15} 0.15 } 
0.08 | 
0.1} 


0.1} 0.06 ; 


0.04 + 


0.05 } 0.05 + 


0.02 | 


0 
5 10 15 20 25 30 


0 
12 


2 4 6 8 10 
Figure 6.15: Pictorial illustration of the Central Limit Theorem. Suppose we throw a die and record the 
face. [Left] If we only have one die, then the distribution of the face is uniform. [Middle] If we throw 
two dice, the distribution is the convolution of two uniform distributions. This will give us a triangle 


distribution. [Right] If we throw five dice, the distribution is becoming similar to a Gaussian. The Central 
Limit Theorem says that as N goes to infinity, the distribution of the sum will converge to a Gaussian. 


What does the Central Limit Theorem say? Let Xy be the sample average, and let 
ZN = JN (2-4) be the normalized variable. The Central Limit Theorem is as follows: 


Central Limit Theorem: 


The CDF of Zy is converging pointwise to the CDF of Gaussian(0,1). 


Note that we are very careful here. We are not saying that the PDF of Zy is converging to 
the PDF of a Gaussian, nor are we saying that the random variable Zy is converging to a 
Gaussian random variable. We are only saying that the values of the CDF are converging 
pointwise. The difference is subtle but important. 

To understand the difficulty and the core ideas, we first present the concept of conver- 
gence in distribution. 
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6.4.1 Convergence in distribution 


Definition 6.8. Let 2,...,Zn be random variables with CDFs Fz,,...,F zy respec- 
tively. We say that a sequence of Z1,...,ZN converges in distribution to a random 
variable Z with CDF Fz if 


lim Pan (z) = Fz(z), (6.37) 


N->co 


for every continuous point z of Fz. We write Zn 4, Z to denote convergence in 
distribution. 


This definition involves many concepts, which we will discuss one by one. However, the 
definition can be summarized in a nutshell as follows. 


Convergence in distribution = values of the CDF converge. 


Example 1. (Bernoulli) Consider flipping a fair coin N times. Denote each coin flip as a 
Bernoulli random variable X,, ~ Bernoulli(p), where n = 1,2,...,N. Define Zy as the sum 
of N Bernoulli random variables, so that 


We know that the resulting random variable Zy is a binomial random variable with mean 
Np and variance Np(1 — p). Let us plot the PDF fz, (z) as shown in Figure 6.16. 


0.3 . . r 0.12 


**) fZ10(2) 


Binomial Binomial 
umm Gaussian 0.1} |= Gaussian 


x fz(z) 


0 2 4 6 8 10 } 10 20 30 40 50 


max Jen (2) - fe)! > 


Figure 6.16: Convergence in distribution. The convergence in distribution concerns the convergence of 
the values of the CDF (not the PDF). In this figure, we let Zy = X, +---+ Xwn, where Xy is a 
Bernoulli random variable with parameter p. Since a sum of Bernoulli random variables is a binomial, 
Zn is a binomial random variable with parameters (N,p). We plot the PDF of Zy, which is a train of 
delta functions, and compare it with the Gaussian PDF. Observe that the error, max, | fz,,(z) — fz(z)|, 
does not converge to 0. The PDF of Zy is a binomial. A binomial is always a binomial. It will not turn 
into a Gaussian. 


The first thing we notice in the figure is that as N increases, the PDF of the binomial 
has an envelope that is “very Gaussian”. So one temptation is to say that the random 
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variable Zy is converging to another random variable Z. In addition, we would think that 
the PDFs converge in the sense that for all z, 


fex(2)=(*)p0-py* fale) = Gegew {-h. 


where = Np and o? = Np(1—p). 

Unfortunately this argument does not work, because fz(z) is continuous but fz, (z) 
is discrete. The sample space of Zy and the sample space of Z are completely different. In 
fact, if we write fz,, as an impulse train, we observe that 


fam (2) = 3 (a p50. 


Clearly, no matter how big the N is, the difference |fz,,(z) — fz(z)| will never go to zero 
for non-integer values of z. Mathematically, we can show that 


max |fz(2) — fa(2)| H+ 0, 


as N > oo. Zy is a binomial random variable regardless of N. It will not become a Gaussian. 

If fz, (z) is not converging to a Gaussian PDF, how do we explain the convergence? 
The answer is to look at the CDF. For discrete PDFs such as a binomial random variable, 
the CDF is a staircase function. What we can show is that 


Fey(2)=3-(¥ pia — pyr = Fae) f : exp { eh dt. 


oar oo V210? 


The difference between the PDF convergence and the CDF convergence is that the PDF 
does not allow a meaningful “distance” between a discrete function and continuous function. 
For CDF, the distance is well defined by taking the difference between the staircase function 
and the continuous function. For example, we can compute 


|F'zy (z) — Fz(z)I, for all continuous points z of Fz, 


and show that 
max |P'z,,(z) — Fz(z)| — 0. 


We need to pay attention to the set of z’s. We do not evaluate all z’s but only the z’s 
that are continuous points of Fz. If Fz is Gaussian, this does not matter because all z’s 
are continuous. However, for CDFs containing discontinuous points, our definition of con- 
vergence in distribution will ignore these discontinuous points because they have a measure 
zero. 


Example 2. (Poisson) Consider X,, ~ Poisson(A), and consider X1,...,Xy. Define Zy = 
yy, Xn. It follows that E[Zy] = <’_, ELX,] = NA and Var[Zy] = ~X_, Var[Xn] = Nd. 
Moreover, we know that the sum of Poissons remains a Poisson. Therefore, the PDF of Zy 


is 


_ yo (Nay --4 (2-4)? 
foxl2)= >, kl e€ NA5(z — k) and fz(z) = as ex { 2 \. 
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=== Gaussian 


~s Fz(z) 


[= Binomial 
==" Gaussian 


N =10 N =50 


nex [Fen () — Fo 


Figure 6.17: Convergence in distribution. This is the same as Figure 6.16, but this time we plot the 
CDF of Zn. The CDF is a staircase function. We compare it with the Gaussian CDF. Observe that the 
error, max, |f'z,,(z) — F'z(z)|, converges to zero as N grows. Convergence in distribution says that the 
sequence of CDFs F7z,,(z) will converge to the limiting CDF F'z(z), at all continuous points of F'z(z). 


where « = N) and o? = Nd. Again, fz, does not converge to fz. However, if we compare 
the CDF, we can see from Figure 6.18 that the CDF of the Poisson is becoming better 
approximated by the Gaussian. 


Interpreting “convergence in distribution”. After seeing two examples, you should 
now have some idea of what “convergence in distribution” means. This concept applies to 
the CDFs. When we write 

lim Fz,(z) = Fz(z), (6.38) 


N-oo 
we mean that F'z,,(z) is converging to the value F(z), and this relationship holds for all 
the continuous z’s of Fz. It does not say that the random variable Zy is becoming another 
random variable Z. 


ZN ae as equivalent to limy 0 Fz, (z) = Fz(z). 


Example 3. (Exponential) So far, we have studied the sum of discrete random variables. 
Now, let’s take a look at continuous random variables. Consider X,, ~ Exponential(A), and 
let X1,...,Xy be iid. copies. Define Zy = ~*_, Xp. Then E[Zy] = X_, E[X;] = N/A 
and Var[Zy] = 34. How about the PDF of Zy? Using the characteristic functions, we know 


that 


7A —iax F : _ » 
fx, (4) = re <> yx, (jw) = a 

Therefore, the product is 
N 

AN Pal (N —1)! 

Zn (Jw) U x, (9) (At jw)% (A+ jw)% * (N — 1)! 

N _ ! N 
A . (N 1)! F A yN-le-r2 - fay (2). 


“(W_D! Otjo** °(W_D! 
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0.2 0.15 0.06 
—— Poisson —— Poisson —— Poisson 
ees Gaussian === Gaussian === Gaussian 
0.15 
0.1 
0.1 
0.05 
0.05 
0 0 
0 6 8 0 5 10 15 20 
1 1 
0.8 0.8 
0.6 0.6 
0.4 0.4 
0.2 —— Poisson 0.2 —— Poisson —— Poisson 
=== Gaussian =e Gaussian == Gaussian 
0 0 


0 2 4 6 8 0 5 10 15 20 0 20 40 60 80 100 


(a) N=4 (b) N = 10 (c) N =50 


Figure 6.18: Convergence in distribution for a sum of Poisson random variables. Here we assume that 
X1,...,Xw are i.i.d. Poisson with a parameter A. We let Zy = se Xy, be the sum, and compute the 
corresponding PDF (top row) and CDFs (bottom row). Just as with the binomial example, the PDFs of 
the Poisson do not converge but the CDFs of the Poisson converge to the CDF of a Gaussian. 


This resulting PDF fz, (z) = nae ek” 


CDF of the Erlang distribution is 


is known as the Erlang distribution. The 


Faye) = f ” Fay (iva 
Zz N 


_ A N-1,—At 
-{ (wi! e€ dt 


= Gamma function(z, N), 


where the last integral is known as the incomplete gamma function, evaluated at z. 

Given all these, we can now compare the PDF and the CDF of Zy versus Z. Figure 6.19 
shows the PDFs and the CDF's of Zy for various N values. In this experiment we set A = 1. 
As we can see from the experiment, the Erlang distribution’s PDF and CDF converge to 
a Gaussian. In fact, for continuous random variables such as exponential random variables, 
we indeed have the random variable Zy converging to the random variable Z. This is quite 
different from discrete random variables, where Zy does not converge to Z but only Fz,, 
converges to Fz. 


Is —*> stronger than —"+? Convergence in distribution is actually weaker than con- 
vergence in probability. Consider a continuous random variable X with a symmetric PDF 
fx(x) such that fx(a) = fx(—a). It holds that the PDF of —X has the same PDF. If 
we define the sequence Zy = X if N is odd and Zy = —X if N is even, and let Z = X, 
then F'z,,(z) = Fz(z) for every z because the PDF of X and —X are identical. There- 


fore, Zn a Z However, Zn fs Z because Zy oscillates between the random variables X 
and —X. These two random variables are different (although they have the same CDF) 
because P[X = —X] = P[{w: X(w) = —X(w)}] = Pw: X(w) = 0}] =0. 
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0.25 0.15 0.06 
0.2 
0.15 0.1 0.04 
0.1 
0.05 0.02 


—— Sum of Exponential 
== Gaussian 


—— Sum of Exponential 
== Gaussian 


—— Sum of Exponential N 
=n Gaussian 


0 2 4 6 8 0 5 10 15 20 0 20 40 60 80 100 
1 1 
0.8 0.8 
0.6 0.6 
0.4 0.4 
0.2 —— Sum of Exponential 0.2 —— Sum of Exponential —— Sum of Exponential 
=== Gaussian == Gaussian een Gaussian 


0 2 4 6 8 0 5 10 15 20 0 20 40 60 80 100 


(a) N=4 (b) N = 10 (c) N =50 


Figure 6.19: Convergence in distribution for a sum of exponential random variables. Here we assume 
that X1,...,Xw are i.i.d. exponentials with a parameter A. We define Zy = ~ Xn be the sum. 
It is known that the sum of exponentials is an Erlang. We compute the corresponding PDF (top row) 
and CDFs (bottom row). Unlike the previous two examples, in this example we see that both PDFs and 
CDFs of the Erlang distribution are converging to a Gaussian. 


6.4.2. Central Limit Theorem 


Theorem 6.19 (Central Limit Theorem). Let Xy,...,X wy be i.i.d. random variables 
of mean E[X;,] = wu and variance Var[X,,] = 07. Also, assume that E||X3|] < oo. Let 


Xy =(1/N) Sie Xn, be the sample average, and let Zy = JN (=), Then 


lim Fz, (z) = Fz(z), (6.39) 


N- co 


where Z = Gaussian(0, 1). 


In plain words, the Central Limit Theorem says that the sample average (which is 
a random variable) has a CDF converging to the CDF of a Gaussian. Therefore, if we 
want to evaluate probabilities associated with the sample average, we can approximate the 
probability by the probability of a Gaussian. 

As we discussed above, the Central Limit Theorem does not mean that the random 
variable Zy is converging to a Gaussian random variable, nor does it mean that the PDF 
of Zy is converging to the PDF of a Gaussian. It only means that the CDF of Zy is 
converging to the CDF of a Gaussian. Many people think that the Central Limit Theorem 
means “sample average converges to Gaussian”. This is incorrect for the above reasons. 
However, it is not completely wrong. For continuous random variables where both PDF 
and CDF are continuous, we will not run into situations where the PDF is a train of delta 
functions. In this case, convergence in CDF can be translated to convergence in PDF. 

The power of the Central Limit Theorem is that the result holds for any distribution 
of X,,...,Xy. That is, regardless of the distribution of X),...,X., the CDF of Xy is 
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approaching a Gaussian. 


Summary of the Central Limit Theorem 


X,,...,Xy are iid. random variables, with mean pz and variance o?. They are 
not necessarily Gaussians. 


Define the sample average as X y = (1/N) Seeae X,, and let Zy = /N (5+). 


The Central Limit Theorem says Zy =e Gaussian(0, 1). Equivalently, the the- 
orem says that NX y —+ Gaussian(ji, 0”). 


So if we want to evaluate the probability of Xj € A for some set A, we can 
approximate the probability by evaluating the Gaussian: 


P[X ny € A] = ~ | eT 0-2 


CLT does not say that the PDF of Xj is becoming a Gaussian PDF. 
CLT only says that the CDF of Xy is becoming a Gaussian CDF. 


If the set A is an interval, we can use the standard Gaussian CDF to compute the 
probability. 


pray 6.3. Let X1,...,Xn be i.i.d. random variables with mean po and vari- 
ance o?. Define the sample average as X ny = (1/N) ae X,,. Then 


na 


Pla<Xy <d]xe (viv) _® (vz) (6.40) 


where ®(z) = e~ = da is the CDF of the standard Gaussian. 


27 


Proof. By the Central Limit Theorem, we know that X y Gaussian(,1, - Therefore, 


Pla< Xy <)]x 


“me p{-# (o2/N) 
=f eet =o Net) (VN), 


A graphical illustration of the CLT is shown in Figure 6.20, where we use a binomial 
random variable (which is the sum of i.i.d. Bernoulli) as an example. The CLT does not say 
that the binomial random variable is becoming a Gaussian. It only says that the probability 
covered by the binomial can be approximated by the Gaussian. 
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Binomial PDF Gaussian PDF 


2 4 6 8 10 
Pla < Xy <b] = sum of deltas Pla < Xy <b] = integrate the curve 
Figure 6.20: The Central Limit Theorem says that if we want to evaluate the probability Pla < Xn < 6], 


where Xn = (1/N) pe Xn is the sample average of i.i.d. random variables X1,...,X~, we can 
approximate the probability by integrating the Gaussian PDF. 


The following proof of the Central Limit Theorem can be skipped if this is your first time 


reading the book. 


Proof of the Central Limit Theorem. We now give a “proof” of the Central Limit 
Theorem. Technically speaking, this proof does not prove the convergence of the CDF as the 
theorem claims; it only proves that the moment-generating function converges. The actual 
proof of the CDF convergence is based on the Berry-Esseen Theorem, which is beyond the 
scope of this book. However, what we prove below is still useful because it gives us some 
intuition about why Gaussian is the limiting random variable we should consider in the first 
place. 

Let Zy = VN (=). It follows that E[Zy] = 0 and Var[Zy] = 1. Therefore, if we 


can show that Zj is converging to a standard Gaussian random variable Z ~ Gaussian(0, 1), 
then by the linear transformation property of Gaussian, Y = Tn +p will be Gaussian(p1,07/N). 
Our proof is based on analyzing the moment-generating function of Zy. In particular, 


Xn-H N : 
Mz,(s) oe Blet2m] = |W )) = [Te [err]. 


n=1 


Expanding the exponential term using the Taylor expansion (Chapter 1.2), 


N 
I E lean Xn] 


Te [14 — H) 4 Bee u)? 4 o( Se) 
7 2 a\N 
=I [+5 SEX — H+ ay [Xn Dal = (+3) . 


374 


6.4. CENTRAL LIMIT THEOREM 


N 
It remains to show that (1 + éy) — e* /2, If we can show that, we have shown that the 


MGF of Zy is also the MGF of Gaussian(0, 1). To this end, we consider log(1 + x). By the 
Taylor approximation, we have that 


2. 


2 
log(1 + 2) = log(1) + ua log t|e—1 ) 2+ ae log z|n—-1 ) — + O(2°). 
dx dx? 2 


Therefore, we have log (1 + Sr) =~ a Jo As N - ov, the limit becomes 


N 2 
and so taking the exponential on both sides yields limy_... (1 + sy) =e. Therefore, 


s? . . . 
we conclude that limy_.. Mz,(s) =e2, and so Zy is converging to a Gaussian. 


Limitation of our proof. The limitation of our proof lies in the issue of whether the 
integration and the limit are interchangeable: 


ia te tan { / fay (z)e* az} 


N-co N- co 


i - (lim, fay (2) €°* dz. 


If they were, then proving limy_.. Mz, (s) = Mz(s) is sufficient to claim fz, (z) > fz(z). 
However, we know that the latter is not true in general. For example, if fz,,(z) is a train of 
delta functions, then the limit and the integration are not interchangeable. 


Berry-Esseen Theorem. The formal way of proving the Central Limit Theorem is to 
prove the Berry-Esseen Theorem. The theorem states that 


B 
where (6 and C’ are universal constants. Here, you can more or less treat the supremum 
operator as the maximum. The left-hand side represents the worst-case error of the CDF 
Fz, compared to the limiting CDF Fz. The right-hand side involves several constants C, 
8, and o, but they are fixed. 

As N goes to infinity, the right-hand side will converge to zero. Therefore, if we can 
prove this result, then we have proved the actual Central Limit Theorem. In addition, 
we have found the rate of convergence since the right-hand side tells us that the error 
drops at the rate of 1/ VN, which is not particularly fast but is sufficient for our purpose. 
Unfortunately, proving the Berry-Esseen theorem is not easy. One of the difficulties, for 
example, is that one needs to deal with the infinite convolutions in the time domain or the 
frequency domain. 


sup Fy (z) = Fz (z) 


zER 


Interpreting our proof. If our proof is not completely valid, why do we mention it? 
For one thing, it provides us with some useful intuition. For most of the (well-behaving) 
random variables whose moments are finite, the exponential term in the moment-generating 
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function can be truncated to the second-order polynomial. Since a second-order polynomial 
is a Gaussian, it naturally concludes that as long as we can perform such truncation the 
truncated random variable will be Gaussian. 

To convince you that the Gaussian MGF is the second-order approximation to other 
MGFs, we use Bernoulli as an example. Let X1,...,Xy be i.i.d. Bernoulli with a parame- 
ter p. Then the moment-generating function of X y = (1/N) Ss X,, would be: 


N 
Mx, (s) = D[es~ | a [es Unni Xn] = Il se X] 


8 S 8? x 


sp 
= 1 — — 
(+3 + N ae 2N 2)" 
Using the logarithmic approximation, it follows that 


SP 
log Mx, (s) = N log (1+3 + = + =) 


Taking the exponential on both sides, we have that 
My (s) = exp {sp + 


which is the MGF of a Gaussian random variable Y ~ Gaussian (». le P) 


Figure 6.21 shows several MGFs. In each of the subfigures we aa the exact MGF 
Mx,,(s) = (1—p+pe* )™ as a function of s. (The parameter p in this example is p = 0.5.) We 
vary the number N, and we inspect how the shape of Mx, (s) changes. On top of the exact 
8 se p) 


MGFs, we plot the Gaussian approximations My(s) = exp {sp + \. According to 


our calculation, this Gaussian approximation is the second-order a to the exact 
MGF. The figures show the effect of the second-order approximation. For example, in (a) 
when N = 2 the Gaussian is a quadratic approximation of the exact MGF. For (b) and (c), 
as N increases, the approximation improves. 

The reason why the second-order approximation works for Gaussian is that when N 
increases, the higher order moments of X y vanish and only the leading first two moments 


8 a (Ue P) 


survive. The MGFs are becoming flat because My(s) = exp {sp + } converges to 


exp{sp} when N — oo. Taking the inverse Laplace transform, My(s) = oe sp} corresponds 
to a delta function. This makes sense because as N grows, the variance of the X shrinks. 


End of the discussion. 


376 


6.4. CENTRAL LIMIT THEOREM 


10° a 10° a 10° ae 
104 == Binomial MGF Ps 104 ==Binomial MGF 104 = Binomial MGF 
== Gaussian MGF == Gaussian MGF 10? == Gaussian MGF 


(a) N=2 (b) N=4 (c) N = 10 
Figure 6.21: Explanation of the Central Limit Theorem using the function. In this set of plots, we show 
the MGF of the random variable X y = (1/N) > Xn, where X1,..., Xn are i.i.d. Bernoulli random 


variables. The exact MGF of Xj is the binomial, whereas the approximated MGF is the Gaussian. We 
observe that as N increases, the Gaussian approximation to the exact MGF improves. 


6.4.3. Examples 


Example 6.15. Prove the equivalence of a few statements. 
e VN (234) 4, Gaussian(0, 1) 
e VN(Xn —p) 4 Gaussian(0, 0”) 
e /NXy 4 Gaussian(si, 0”) 


Solution. The proof is based on the linear transformation property of Gaussian ran- 
dom variables. For example, if the first statement is true, then the second statement 
is also true because 


lim F yyy —,)(2) = lim PIVN(Xn — 4) <z] = lim P viv (==*) z | 


Noo Noo N-oco oO oO 


i: tet 2 7 1 sf 
— —e 2? dt= e 207 dt. 
co V2 ~co V 2102 


The other results can be proved similarly. 


Example 6.16. Suppose X,, ~ Poisson(10) for n = 1,...,N, and let Xi be the 
sample average. Use the Central Limit Theorem to approximate P[9 < Xy < 11] for 
N = 20. 


Solution. We first show that 
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Therefore, the Central Limit Theorem implies that X, —“> Gaussian (10,4). The 
probability is 


ro <Xycuj~e(an| (2) 
1/2 Ve 


il 1 
= ® = 0.9214 — 0.0786 = 0.8427. 
eS ( = 


We can also do an exact calculation to verify our approximation. Let Sy = 
eae X,, so that Xy = Sy Since a sum of Poisson remains a Poisson, it follows that 


Sn ~ Poisson(10N) = Poisson(200). 
Consequently, 
P[9 < Xw < 11] = P[180 < Sy < 220] 


2002-209 Ss 2002-200 


a = 0.9247 — 0.0822 = 0.8425. 


£=0 


Note that this is an exact calculation subject to numerical errors when evaluating the 
finite sums. 


Example 6.17. Suppose you have collected N = 100 data points from an unknown 
distribution. The only thing you know is that the true population mean is z = 500 
and the standard deviation is o = 80. (Note that this distribution is not necessarily a 
Gaussian.) 


(a) Find the probability that the sample mean will be inside the interval (490, 510). 


(b) Find an interval such that 95% of the sample average is covered. 


= 2 
Solution. To solve (a), we note that X y 4, Gaussian (s00. (f5) i Therefore, 


= 510 — 500 490 — 500 
Pino < Ky <510|= 9 ( 33 )-«( - 


V100 100 
= (1.25) — ®(—1.25) = 0.7888. 


To solve (b), we know that ®(x) = 0.025 implies that « = —1.96, and ®(x) = 0.975 
implies that « = +1.96. So 


= 21596 y = 484.32 or y=515.68. 


n < 515.68] = 0.95. 
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6.4.4 Limitation of the Central Limit Theorem 


If we recall the statement of the Central Limit Theorem (Berry-Esseen), we observe that 
the theorem states only that 


Jim P viv (=) < | = Jim_ Fzy(e) = Fe(e) = 6). 


Rearranging the terms, 


vim LP [x went = P(e). 


This implies that the approximation is good only when the deviation ¢€ is small. 

Let us consider an example to illustrate this idea. Consider a set of i.i.d. exponential 
random variables X1,...,X~, where X,, ~ Exponential(\). Let Sy = X, +---+ Xn be 
the sum, and let X = Sy/N be the sample average. Then, according to Chapter 6.4.1, Sj 
is an Erlang distribution Sy ~ Erlang(N, A) with a PDF 


N 
fn ( ) = te N-1,-Aa 


Practice Exercise 6.10. Let Sy ~ Erlang(N,.) with a PDF fs, (a). Show that if 
Yn = aSy + 6) for any constants a and 6, then 


Fyn (y) = * fs (2 =). 


Solution: This is a simple transformation of random variables: 


=( 
Frv(y) =PIY <u) = Plas +6<y] =P] Sy < 


Hence, using the fundamental theorem of calculus, 


fyy y 


We are interested in knowing the statistics of X y and comparing it with a Gaussian. 
To this end, we construct a normalized variable 


Xn -p 
o/VN ’ 


where . = E[X,] = ¢ and o? = Var[X,] = yz. Then 


Zn = 


Sn/N — pu Sn —Nu » a 
Vo f/N oJN VN 
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Using the result of the practice exercise, by mapping a = Kn and b = —VN, it follows that 


VN (=) 
fice = 


fax (2) = r 


VN 

Now we compare Zy with the standard Gaussian Z ~ Gaussian(0, 1). According to the 
Central Limit Theorem, the standard Gaussian is a good approximation to the normalized 
sample average Zy. To compare the two results, we conduct a numerical experiment. We 
let A = 1 and we vary N. We plot the PDF fz, (z) as a function of z, for different N’s, in 
Figure 6.22. In addition, we plot the PDF fz(z), which is the standard Gaussian. 

The plot in Figure 6.22 shows that while the Central Limit Theorem provides a good 
approximation, the approximation is only good for values that are close to the mean. For 
the tails, the Gaussian approximation is not as good. 


N=1 
107: N=10 
=—=N = 100 
=—N = 1000 
=== Gaussian N 
10° 
-1 0 1 2 3 4 5 
Figure 6.22: CLT fails at the tails. We note that X1,...,Xw are i.i.d. exponential with a parameter 
A = 1. We plot the PDFs of the normalized sample average Zn = oe by varying NV. We plot the PDF 


of the standard Gaussian Z ~ Gaussian(0,1) on the same grid. Note that the Gaussian approximation 
is good for values that are close to the mean. For the tails, the Gaussian approximation is not very 
accurate. 


The limitation of the Central Limit Theorem is attributable to the fact that Gaussian 
is a second-order approximation. If a random variable has a very large third moment, the 
second-order approximation may not be sufficient. In this case, we need a much larger N to 
drive the third moment to a small value and make the Gaussian approximation valid. 


When will the Central Limit Theorem fail? 
e The Central Limit Theorem fails when N is small. 


e The Central Limit Theorem fails if the third moment is large. As an extreme 
case, a Cauchy random variable does not have a finite third moment. The Central 
Limit Theorem is not valid for this case. 


e The Central Limit Theorem can only approximate the probability for input val- 
ues near the mean. It does not approximate the tails, for which we need to use 
Chernoff’s bound. 
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6.5 Summary 


Why do we need to study the sample average? Because it is the summary of the dataset. In 
machine learning, one of the most frequently asked questions is about the number of training 
samples required to train a model. The answer can be found by analyzing the average number 
of successes and failures as the number of training samples grows. For example, if we define 
f as the classifier that takes a data point x, and predicts a label f(a,), we hope that it 
will match with the true label y,. If we define an error 


E.= 1, f(an)=Yn correct classification, 
a i f(an) 4 Yn incorrect classification, 


then E,, is a Bernoulli random variable, and the total loss € = + nee Ey, will be the 
training loss. But what is a ae , En? It is exactly the sample average of E,,. Therefore, by 
analyzing the sample average € we will learn something about the generalization capability 
of our model. 


How should we study the sample average? By understanding the law of large numbers 
and the Central Limit Theorem, as we have seen in this chapter. 


e Law of large numbers: X converges to the true mean p as N grows. 


e Central Limit Theorem: The CDF of X can be approximated by the CDF of a 
Gaussian, as N grows. 


Performance guarantee? The other topic we discussed in this chapter is the concept 
of convergence type. There are essentially four types of convergence, ranked in the order of 
restrictions. 


e Deterministic convergence: A sequence of deterministic numbers converges to 
another deterministic number. For example, the sequence 1, 5, 4, 7,--. converges 
to 0 deterministically. There is nothing random about it. 


Almost sure convergence: Randomness exists, and there is a probabilistic con- 
vergence. Almost sure convergence means that there is zero probability of failure 
after a finite number of failures. 


Convergence in probability: The sequence of probability values converges, i.e., 
the chance of failure is going to zero. However, you can still fail even if your N 
is large. 


Convergence in distribution: The probability values can be approximated by 
the CDF of a Gaussian. 
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6.7 Problems 


Exercise 1. (VIDEO SOLUTION) 
Let X,Y, Z be three independent random variables: 


X ~ Bernoulli(p), Y ~ Exponential(a), Z ~ Poisson(A) 


Find the function for the following random variables. 


(a) U=Y+Z 

(b) U=2Z7+4+3 

(c) U= XY 

(d) U=2XY4+(1-X)Z 


Exercise 2. (VIDEO SOLUTION) 
Two random variables X and Y have the joint PMF 


Pe a) a ain 


Let Z = X +Y. Find the function Mz(s) and the PMF of Z. 


Exercise 3. (VIDEO SOLUTION) 


Let Xo, X1,... be a sequence of independent random variables with PDF 
Qk = 1 
fx, (2) = maz + 22)’ Ok = DFT? 


for k = 0,1,.... Find the PDF of Y, where 


Hint: You may find the characteristic function useful. 


Exercise 4. 
The random variables X and Y are independent and have PDFs 


fx(@) - +i r= 0, and fy (y) = ie y> 0, 


0, x<0, 


; m=0,1,2,..., n>—m. 


6.7. PROBLEMS 


Find the PDF of Z = X + Y. (Hint: Use the characteristic function and the moment- 
generating function.) 


Exercise 5. 
A discrete random variable X has a PMF 


1 
px(k) = 5p: k=1,2,.... 


Find the characteristic function ®x (jw). 


Exercise 6. 
Let 7), 7>,... be iid. random variables with PDF 


ve, t>0, 
t)h= 
fr, (t) 4 t<0, 


for k= 152), 11. Lets, = 54 De Find the:PDF of S,,; 


Exercise 7. (VIDEO SOLUTION) 
In this exercise we will prove a variant of Chebyshev when the variance o? is unknown but 
X is bounded between a < X < b. 


(a) Let y € R. Find ay that minimizes E[(X —y)?]. Hence, show that E[(X—~)?] > Var[X] 
for any ¥. 


(b) Let y = (a + b)/2. Show that 


i[(X — 7)"] = E[(X — a)(X - 6) + | 


(6=a)? 
(c) From (a) and (b), show that Var[X] < “>. 
(d) Show that for any € > 0, 


PIX -y|>e]< 


Exercise 8. 
The random variables X and Y are independent with PDFs 


1 1 
=. d — es 
m(1 +2?) ia fr(y) m(1+y?)’ 
respectively. Find the PDF of Z = X — Y. (Hint: Use the characteristic function.) 


fx (x) 


Exercise 9. 
A random variable X has the characteristic function 


®x (jw) = eIe/ 1-3) 
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Find the mean and variance of X. 


Exercise 10. 
Show that for any random variables X and Y, 


PIX -Y|>¢< SE[(X -Y)?]. 


1 
2 


Exercise 11. 

Let X be an exponential random variable with a parameter A. Let pp = E[X] and o? = 
Var[X]. Compute P[|X — | > ko] for any k > 1. Compare this to the bound obtained by 
Chebyshev’s inequality. 


Exercise 12. 
Let X1,..., Xn be i.i.d. Bernoulli with a parameter p. Let a > 0 and define 


Let Xv = % pae X,. Define an interval 
T= [Xn —€«, Xv +e). 
Use Hoeffding’s inequality to show that 


P[Z contains p] >1—a. 


Exercise 13. 
Let Z ~ Gaussian(0, 1). Prove that for any € > 0, 


2 


9 aS 
PI|Z| >< ie = 
TT € 


Hint: Note that €P[|Z| > «] = 2eP[Z > «], and then follow the procedure we used to prove 
Markov’s inequality. 


Exercise 14. 


(a) Give a non-negative random variable X > 0 such that Markov’s inequality is met with 
equality. Hint: Consider a discrete random variable. 


(b) Give a random variable X such that Chebyshev’s inequality is met with equality. 


Exercise 15. 
Consider a random variable X such that 


2,2 


ie*|<e°r.. 
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(a) Show that for any t, 
t? 
PIX >t] < —-——=>. 
[xX > |< ew{ aa} 
Hint: Use Chernoff’s bound. 
(b) Show that 


iLX?] < 40”. 


Hint: First prove that E[X?] = [>° P[X? > t] dt. Then use part (a) above. 


Exercise 16. 
Let X1,..., Xn bei.i.d. uniform random variables distributed over [0, 1]. Suppose Yi,..., Yiv 
are defined as follows. 


(a) Y, = X,/n 
(b) Ya =(Xn)" 
(c) Y, = max(X1,...,Xn) 
(d) Y, = min(X1,..., Xn) 


For (a), (b), (c), and (d), show that Y,, converges in probability to some limit. Identify the 
limit in each case. 


Exercise 17. 
Let An = 1 for n = 1,2,.... Let X, ~ Poisson(,,,). Show that X,, converges in probability 
to 0. 


Exercise 18. 


Let Y,, Y2,... be a sequence of random variables such that 
Y= 0, with probability 1 — 1, 
ee is with probability 1. 


Does Y,, converge in probability to 0? 


Exercise 19. (VIDEO SOLUTION) 
A Laplace random variable has a PDF 


aN 
fx (x) = re A> 0, 
and the variance is Var|X] = =. Let X1,...,X500 be a sequence of i.i.d. Laplace random 


variables. Let 
X41 + aor + X500 


500 


(a) Find E[X]. Express your answer in terms of X. 


Ms00 = 
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(b) Let \ = 10. Using Chebyshev’s inequality, find a lower bound of 
P[-0.1 < Msoo < 0.1). 

(c) Let A = 10. Using the Central Limit Theorem, find the probability 
P [—0.1 < Mso9 < 0.1). 


You may leave your answer in terms of the ®(-) function. 


Exercise 20. (VIDEO SOLUTION) 
Let X1,...,Xyw be a sequence of i.i.d. random variables such that X, = +1 with equal 
probability. Let 


1 N 
5 a 
" Tn 2 


Prove the Central Limit Theorem for this particular sequence of random variables by showing 
that 


(a) EX y] =0, Var[Xy] = 1. 


(b) The moment-generating function of Xv is My, (s) > e® as N +00. 


Exercise 21. (VIDEO SOLUTION) 
Let X1,...,Xy be a sequence of i.i.d. random variables with mean and variance 


|<, | =e and? Var[Xep=07) = Lyi: 


The distribution of X,, is, unknown. Let 


Use the Central Limit Theorem to estimate the probability PL[My > 2u]. 
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Chapter 7 


Regression 


Starting with this chapter, we will discuss several combat skills — techniques that we use 
to do the actual data analysis. The theme of this topic is learning and inference, which are 
both at the core of modern data science. The word “learning” can be broadly interpreted 
as seeking the best model to explain the data, and the word “inference” refers to prediction 
and recovery. Here, prediction means that we use the observed data to forecast or generalize 
to unseen situations, whereas recovery means that we try to restore the missing data in our 
current observations. In this chapter we will learn regression, one of the most widely used 
learning and inference techniques. 

Regression is a process for finding the relationship between the inputs and the outputs. 
In a regression problem, we consider a set of input data {21,...,2,)} and a set of output 
data {y1,..., yn}. We call the set of these input-output pairs D def {(r1,y1),---, (aw, yn) } 
the training data. The true relationship between an xz, and a yy, is unknown. We do not 
know, you do not know, only God knows. We denote this unknown relationship as a mapping 
f(-) that takes x, and maps it to yn, 


Yn = f(Zn), 


as illustrated in Figure 7.1. 


In 
input 


Figure 7.1: A regression problem is about finding the best approximation to the input-output relationship 
of the data. 


Since we do not know f(-), finding it from a set of finite number of data points D = 
{(z1,y1),---,(@w,yn)} is infeasible — there are infinitely many ways we can make y,, = 
f(an) for every n = 1,...,.N. The idea of regression is to add a structure to the problem. 
Instead of looking for f(-), we find a proxy g@(-). This proxy go(-) takes a certain parametric 
form. For example, we can postulate that (2,, yn) has a linear relationship, and so 


Jo(Ln) = cal Ye A 5 n=1,...,N. 
“SY “SY 
parameter parameter 
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This equation is a straight line with a slope 0 and a y-intercept 49. We call 6 = [0;, 60] 
the parameter of the model f(-). To emphasize that the function we are using here is 
parameterized by 0, we denote the function by go(-). 

Of course, any model we choose is our guess. It will never be the true model. There is 
always a difference between what our model tells us and what we have observed. We denote 
this “difference” or “error” by e, and define it as: 


Cn =Yn — 9o(tn), n=l,...,N. 


The purpose of regression is to find the best 0 such that the error is minimized. For example, 
consider a minimization of the sum-square error: 


6 =argmin Y° (Yn ~ go(%n))”. 


training loss Et¢rain(@) 


The sum of the squared error is just one of the many possible ways we can define the training 
loss Etrain(@). We will discuss different ways to define the training loss in this chapter, but the 
point should be evident. For a given dataset D = {(x1,y1),.--, (an, yn)}, regression tries 
to find a function gg(-) such that the training loss is minimized. The optimization variable 
is the parameter @. If the function gg(-) is a linear function in 0, we call the regression a 
linear regression. 


Yn 


prediction output 


You decide Fi 
the model Etrain (0) 
training loss 


optimization 


Figure 7.2: A regression problem involves several steps: picking a model ge, defining the training loss 
Etrain(@), and solving the optimization to update @. 


A summary of the regression process is shown in Figure 7.2. Given the training data 
D = {(#1,y1),---;(@n,yn)}, the user picks a model gg(-) to make a prediction. We com- 
pare the predicted value gg(z,,) with the observed value y,,, and compute the training loss 
Etrain(9). The training loss Et;ain(@) is a function of the model parameter @. Different model 
parameters @ give different training loss. We solve an optimization problem to find the best 
model parameter. In practice, we often iterate the process for a few times until the training 
loss is settled down. 
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What is regression? 


Given the data points (21, y1),...,(@n, yn), regression is the process of finding 
the parameter 0 of a function gg(-) such that the training loss is minimized: 


N 


6 = argmin 8 L(Yn, Jo(Xn)), (eu 


OER? =] 


training loss E¢rain(@) 


where £(-,-) is the loss between a pair of true observation y, and the prediction gg(z,). 
One common choice of L(-,-) is £(ge(tn), Yn) = (ge(tn) — Yn). 


Example 1. Fitting the data 


Suppose we have a set of data points (71,41), (a2, y2), ---, (tv, yn), where z,,’s are the 
inputs and y,,’s are the outputs. These pairs of data points can be plotted in a scatter plot, 
as shown in Figure 7.3. We want to find the curve that best fits the data. 

To solve this problem, we first need to choose a model, for example 


9o(Ln) = 4 + O12, + 0222 + 03x + O24. 
We call the coefficients 6 = [80, 01, 02, 03, 04] the regression coefficients. They can be found 
by solving the optimization problem 


N 


2 

a Set ae 2 3 4 
cosntes, 3 (m= Oo tn as + Bh + Bue) 
n= 


-2;); O data fe) 
mum fitted curve 


1 0.5 0 0.5 1 


Figure 7.3: Regression can be used to fit the dataset using curves. In this example, we use a fourth-th 
order polynomial go(x) = eae 0px}, to fit a 50-point dataset. 


This optimization asks for the best @ = [0,...,04]"’ such that the training loss is 
minimized. Solving the minimization problem would require some effort, but if we imagine 
that we have solved it we can find the best curve, which is go(x) = Sar 6px? with the 
optimal @ plugged in. The red curve in Figure 7.3 shows an example in which we have used 
a fourth-order polynomial to fit a dataset comprising 50 data points. We will learn how to 
solve the problem in this chapter. 
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Example 2. Predicting the stock market 


Imagine that you have bought some shares in the stock market. You have looked at the past 
data, and you want to predict the price of the shares over the next few days. How would 
you do it besides just eyeballing the data? 

First, you would plot the data points on a graph. Mathematically, we can denote these 
data points as {71,72,...,un}, where the indices n = 1,2,...,N can be treated as time 
stamps. We assume a simple model to describe the relationship between the x,,’s, say 


Ln YY ALyn—1 + bin—2, 


for some parameters 9 = (a,b).1 This model assumes that the current value x, can be 
approximated by a linear combination of two previous values %,-1 and %,_2. Therefore, if 
we have x; and x2 we should be able to predict «3, and if we have x2 and x3 we should be 
able to predict x4, etc. The magic of this prediction comes from the parameters a and D. If 
we know a and 8, the prediction can be done by simply plugging in the numbers. 

The regression problem here is to estimate the parameters a and b from the data. Since 
we are given a set of training data {71,22,...,2.}, we can check whether our predicted 
value £3 is close to the true x3, and whether our predicted value %4 is close to the true 24, 
etc. This leads to the optimization 


N 2 
(a,b) = argmin S- (= — (a%p_1 + ben-2)) ; 
a,b ALI e"———_-/-_ _“—- 


=prediction 


where we use initial conditions that 79 = x_1 = 0. The optimization problem requires us 
to minimize the disparity between x, and the predicted value axz,_, + b&p_2, for all n. 
By finding the (a,b) that minimizes this objective function, we will accomplish our goal of 
estimating the best (a, b). 

Figure 7.4 shows an example of predicting a random process using the above model. 
If the parameters a and b are properly determined, we will obtain a reasonably well-fitted 
curve to the data. A simple extrapolation to the future timestamp would suffice for the 
forecast task. 


Plan for this chapter 


What are the key ingredients of regression? 


e Learning: Formulate the regression problem as an optimization problem, and 
solve it by finding the best parameters. 


e Inference: Use the estimated parameters and models to predict the unseen data 
points. 


Regression is too broad a topic to be covered adequately in a single chapter. Accord- 
ingly, we will present a few principles and a few practical algorithmic techniques that are 
broadly applicable to many (definitely not all) regression tasks. These include the following. 


1Caution: If you lose money in the stock market by following this naive model, please do not cry. This 
model is greatly oversimplified and probably wrong. 


392 


1.5 


1+ | 
05, 
(>) P A 
0.5/6 / of Lees | 
o\0-. ° 
9 JN 
le) Zs) 
° Vis R mf OS j 
° o” \er \ ° 
° \, “5 
-0.5+| © data 005 x 
=== best fit oO 
candidate 
-1 l L L 1 
0 0.2 0.4 0.6 0.8 1 


Figure 7.4: An autoregression model aims at learning the model parameters based on the previous 
samples. This example illustrates fitting the data using the model x, = a&%n_-1+bz¢n-2, forn = 1,...,N. 


e The principle of regression (Section 7.1). We explain the formulation of a regression 
problem via optimization. There are a few steps involved in developing this concept. 
First, we will exclusively focus on linear models because these models are easier to 
analyze than nonlinear models but are still rich enough for many practical problems. 
We will discuss how to solve the linear regression problem and some applications of 
the solutions. We then address the issue of outliers using a concept called the robust 
linear regression. 


e Overfitting (Section 7.2). The biggest practical challenge of regression is overfitting. 
Overfitting occurs when a model fits too closely to the training samples so that it 
fails to generalize. We will delve deeply into the roots of overfitting and show that 
overfitting depends on three factors: the number of training samples N, the model 
complexity d, and the magnitude of noise o?. 


e Bias-variance trade-off (Section 7.3). We will present one of the most fundamental 
results in learning theory, known as the bias-variance trade-off. It applies to all regres- 
sion problems, not just to linear models. Understanding this trade-off will help you 
understand the fundamental limits of your problem so that you know what to expect 
from the model. 


e Regularization (Section 7.4). In this section we discuss a technique for combatting 
overfitting known as regularization. Regularization is carried out by adding an extra 
term to the regression objective function. By solving the modified optimization, the 
regression solution is improved in two ways: (i) regularization makes the regression 
solution less sensitive to noise perturbations, and (ii) it alleviates the fitting difficulty 
when we have only a few training samples. We will discuss two regularization strategies: 
the ridge regression and the LASSO regression. 


Much of this chapter deals with optimization. If this is your first time reading this 
book, we encourage you to have a reference book on linear algebra at hand. 
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7.1 Principles of Regression 


We start by recalling our discussion in the introduction. The purpose of regression can be 
summarized in a simple statement: 


Given the data points (11, 41),...,(@n, yn), find the parameter 0 of a function gg(-) 
such that the training loss is minimized: 


N 
6 = argmin SS JEG, Glen), (7.2) 


GeR¢ i 


training loss Etrain(@) 


where £(-,-) is the loss between a pair of true observation y, and the prediction gg(x,). 


When the context makes it clear, we will drop the subscript 0 in g@(-) with the understanding 
that the function g(-) is parameterized by 0. 

As you can see, regression finds a function g(-) that best approximates the input-output 
relationship between x, and y,,. There are two choices we need to make when formulating 
a regression problem: 


e Function g(-): What is the family of functions we want to use? This could be a line, a 
polynomial, or a set of basis functions. If it is a polynomial, what is its order? We need 
to make all these decisions before running the regression. A poor choice of function 
family can lead to a poor regression result. 


e Loss “L(-,-)”: How do we measure the closeness between y,, and g(#,,)? Are we measur- 
ing in terms of the squared error (yp — g(2n))?, or the absolute difference |yn — g(an)|, 
or something else? Again, a poor choice of distance function can create a false sense 
of closeness because you might be optimizing for a wrong objective. 


Before we delve into the details, we need to discuss briefly the connection between 
regression and probability. A regression problem can be solved without knowing probability, 
so why is regression discussed in a book on probability? 

This question is related to how much we know about the statistical model and what 
kind of optimality we are seeking. A full answer requires some understanding of maximum 
likelihood estimation and maximum a posteriori estimation, which will be explained in 
Chapter 8. As a quick preview of our results, we summarize the key ideas below: 


How is regression related to probability? 


e If you know the statistical relationship between x, and y,,, then we can construct 
a regression problem that maximizes the likelihood of the underlying distribu- 
tion. Such regression solution is optimal with respect to the likelihood. 


e We can construct a regression problem that can minimize the expectation of the 
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squared error. This regression solution is mean-squared optimal. 


e If you are a Bayesian and you know the prior distribution of x,, then we can 
construct a regression problem that maximizes the posterior distribution. The 
solution to this regression problem is Bayesian optimal. 


e If you know nothing about the statistics of x, and y,, you can still run the 
regression and get something, and this “something” can be very useful. However, 
you cannot claim statistical optimality of this “something” . 


See Chapter 8 for additional discussion. 


It is important to understand that a regression problem is at the intersection of op- 
timization and statistics. The need for optimization is clear because we need to minimize 
the error. The statistical need is to generalize to unknown data. If there is no statistical 
relationship between x, and y,, (for all n), whatever model we obtain from the regression 
will only work for the N training samples. The model will not generalize because knowing 
py Will not help us know y,. In other words, if there is no statistical relationship between 
Lyn and yn, you can fit perfectly to the training data but you will fail miserably to fit the 
testing data. 


7.1.1 Intuition: How to fit a straight line? 


In this subsection we want to give you the basic idea of how regression is formulated. To 
keep things simple, we will discuss how to fit data using a straight line. 

Consider a collection of data points D = {(x1,y1),...,(a@n,yn)}, where z,,’s are the 
inputs and y,,’s are the observations, for example, in the table below. 


In Yn 
0.6700 3.0237 
0.3474 2.3937 
0.6695 3.5548 


WwWwnNnrH!s 


N—1 0.2953 2.6396 
N 0.6804 3.2536 


Let us consider the linear regression problem. The goal of linear regression is to find 
the straight line that best fits the datasets. All straight lines on a 2D graph are plots of the 
equation 

g(a) = ax + b, 


where a is the slope of the line and b is the y-intercept of the line. We denote this line 
by g(-). Note that this function g is characterized by two parameters (a,b) because once 
(a,b) are known the line is determined. If we change (a,b), the line will change as well. 
Therefore, by finding the best line we are essentially searching for the best (a,b) such that 
the training error is minimized. 

The pictorial meaning of linear regression can easily be seen in Figure 7.5, which 
shows N = 50 data points according to some latent distributions. Given these 50 data 
points, we construct several possible candidates for the regression model. These candidates 
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are characterized by the parameters (a,b). For example, the parameters (a,b) = (1,2) and 
(a, b) = (—2,3) represent two different straight lines in the candidate pool. The goal of the 
regression is to find the best line from these candidates. Note that since we limit ourselves 
to straight lines, the candidate set will not include polynomials or trigonometric functions. 
These functions are outside the family we are considering. 


4.5 


4k 


O data 
mums best fit 
candidate 


“0 0.2 0.4 0.6 0.8 1 


Figure 7.5: The objective of least squares fitting (or linear regression) is to find a line that best fits the 
dataset. 


Given these candidate functions, we need to measure the the training loss. This can 
be defined in multiple ways, such as 


e Sum-squared loss E¢rain(@) = ye — g(tn))?. 


N 
N 
— Vne1 Yn log g(@n) + (1 — yn) log(1 — g(@n))). 
e Perceptual loss €j;ain(0) = es, max(—Yyng(@n),0), when y, and g(a,) are binary 
taking values +1. This is a reasonable training error because if y, matches with g(2,), 
then Yng(@n) = 1 and so max(—Yng(an),0) = 0. But if y, does not match with g(z,), 


then Yng(@n,) = —1 and hence max(—y,g(x,,),0) = 1. Thus, the loss captures the sum 
of all the mismatched pairs. 


e Sum-absolute loss €;,ain(@) 


( 
(9) 


e Cross-entropy loss Etrain 


Choosing the loss function is problem-specific. It is also where probability enters the picture 
because, without any knowledge about the distributions of x, and y,, there is no way to 
choose the best training loss. You can still pick one, as we will do, but it will not be granted 
any probabilistic guarantees. 

Among these possible choices of the training error, we are going to focus on the sum- 
squared loss because it is convex and differentiable. This makes the computation easy, 
since we can run any textbook optimization algorithm. The regression problem under the 
sum-squared loss is: 


N 2 
(a,b) =argmin > (u ~ (ay + ») (7.3) 
(a,b) y=] ea ae 
=g(an 


In this equation, the symbol “argmin” means “argument minimize”, which returns the ar- 
gument that minimizes the cost function on the right. The interpretation of the equation is 
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that we seek the (a,b) that minimize the sum ys — (ary, + b))?. Since we are mini- 
mizing the squared error, this linear regression problem is also known as the least squares 
fitting problem. The idea is summarized in the following box. 


What is linear least squares fitting? 
e Find a line g(x) = ax +b that best fits the training data {(2n, yn) }_y. 


e The optimality criterion is to minimize the squared error 


Sea 5 (un - olen) (7.4) 


where @ = (a,b) is the model parameter. 


e There exist other optimality criteria. Squared error is convex and differentiable. 


7.1.2 Solving the linear regression problem 


Let’s consider how to solve the linear regression problem given by Equation (7.3). The 
problem is the following: 
(a,b) =argmin Ejrain(a, d). (7.5) 
(a,b) 


As with any two-dimensional optimization problem, the optimal point (@, b) should 
have a zero gradient, meaning that 
O O 
Bq eran (a, b) i. 0 and Bp etrain (a, b) = 0. 


This should be familiar to you, even if you have only learned basic calculus. This pair of 
equations says that, at a minimum point, the directional slopes should be zero no matter 
which direction you are looking at. 

The derivative with respect to a is 


0 
Bac ttain (a, b) 


= p> (an — (crm +0) } 


_ 2f(u—~(err+) 4 (u- (ext) +4 (uv ~(a2w +5) | 


= (v1 — (ar +8))(-m) ++» 2(ux (axy 4 aT tN) 


N N N 
=2 (- Samm tayo +e] . 
n=1 n=1 


n=1 
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Similarly, the derivative with respect to b is 


N N N 
=2 (- Son redone +533) : 
n=1 n=1 n=1 


Setting these two equations to zero, we have that 
N N N 
2 (— Soe tayo +e] = 0, 
n=1 n=1 n=1 
N N N 
2 (- Son tedin +5303) = 0. 
n=1 n=1 n=1 


Rearranging the terms, the pair can be equivalently written as 


N 5 N N 

Le Ty > In a ss ZnYn 
n=1 n=1 — jn=l 

; edie: 

y In N by Un 


3 
Il 
un 
3 
Il 
a 


Therefore, if we can solve this system of linear equations, we will have the linear regression 
solution. 


Remark. It is easy to see that the solution achieves the minimum instead of the maximum, 
since the second-order derivatives are positive: 


oO? 


N 
oF ag train (4, b) = oe xe >0 and cl (a, b) = 1>0. 


n=1 n=1 


The following theorem summarizes this intermediate result. 


Theorem 7.1. The solution of the problem Equation (7.5) 


(a, ) = = argmin ss (u, ~(axn + »)) 


(a,b) ee 


satisfies the equation 


Mz 
8 
3h 


3 
ll 
ts 


Mz 
8 
3 


3 
Il 
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Matrix-vector form of linear regression 


Solving this linear regression requires some basic linear algebra. The regression can be 
written as 


Y1 vw 1 €1 
— a + 
- : A 
UN tn ll “~e— en 
ee+_,_ ——_—s«=—aa —_—_’ 
y x e 


With X, y, 8 and e, we can write the linear regression problem compactly as 
y=XO+e. 
Therefore, the training loss €j+ain(@) can be defined as 
Etrain(8) = lly — Xl? 
Yi “i 1 
YN tn 1 


Now, taking the gradient with respect to 0 yields? 


VoEnin(®) = Vo4 lly — X0|?} 
= -2XT(y— X0). 
Equating this to zero, we obtain 
XT (y— X60) =0 = X?XO0=X"y. (77) 


Equation (7.7) is called the normal equation. 
The normal equation is a convenient way of constructing the system of linear equations. 
Using the 2-by-2 system shown in Equation (7.6) as an example, we note that 


x 1 x 2 a 
1 
xa ee uo ye 
ee | — | N ’ 
tn 1 2 tn N 
N 
Yi | nl 
T,.  |@1 IN n=1 
ae k >| ae 
YUN Yn 


Therefore, as long as you can construct the X matrix, forming the 2-by-2 system in Equa- 
tion (7.6) is straightforward: start with y = X@ and then multiply the matrix transpose 
X7 to both sides. The resulting system is what you need. There is nothing to memorize. 


?This is a basic vector calculus result. For details, you may consult standard texts such as the University 
of Waterloo’s matrix cookbook. https: //www.math.uwaterloo.ca/~hwolkowi/matrixcookbook. pdf 
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Running linear regression on a computer 


On a computer, solving the linear regression for a line is straightforward. Let us look at the 
MATLAB code first. 


MATLAB code to fit data points using a straight line 
50; 
rand(N,1)*1; 
2.5; true parameter 
1.3; true parameter 
a*x + b + 0.2*rand(size(x)); % Synthesize training data 


X = [x(:) ones(N,1)]; construct the X matrix 
theta = X\y(:); solve y = X theta 


t linspace(0, 1, 200); interpolate and plot 
yhat = theta(1)*t + theta(2); 

plot(x,y,’o’,’LineWidth’ ,2); hold on; 
plot(t,yhat,’r’,’LineWidth’ ,4); 


In this piece of MATLAB code, we need to define the data matrix X. Here, x(:) is the 
column vector that stores all the values (71,...,@y). The all-one vector ones(N,1) is the 
second column in our X matrix. The command X\y(:) is equivalent to solving the normal 
equation 

X?X0=X'y. 
The last few lines are used to plot the predicted curve. Note that theta(1) and theta(2) 
are the entries of the solution 8. The result of this program is exactly the plot shown in 
Figure 7.5 above. 

In Python, the program is quite similar. The command we use to solve the inversion 
is np. linalg.1stsq. 


# Python code to fit data points using a straight line 
import numpy as np 
import matplotlib.pyplot as plt 


50 

np.random.rand(N) 

2.5 # true parameter 

1.3 # true parameter 

a*x + b + 0.24np.random.randn(N) # Synthesize training data 


X = np.column_stack((x, np.ones(N))) # construct the X matrix 
theta = np.linalg.lstsq(X, y, rcond=None) [0] # solve y = X theta 


t = np.linspace(0,1,200) # interpolate and plot 
yhat = theta[0]*t + theta[1] 

plt.plot(x,y,’o’) 

plt.plot(t,yhat,’r’,linewidth=4) 
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7.1.3. Extension: Beyond a straight line 


Regression is a powerful technique. Although we have discussed its usefulness for fitting 
straight lines, the same concept can fit other curves. 

To generalize the regression formulation, we consider a d-dimensional regression coef- 
ficient vector 0 = [09,...,0@a—1]’ € R@ and a general linear model 


d-1 
Jo(&n) = s 9p Pp (an). 
p=0 


Here, the mappings {¢,(-) oh can be considered as a nonlinear transformation that takes 
the input x, and maps it to another value. For example, ¢,(-) = (-)? will map an input x 
to a pth power z?. 

We can now write the system of linear equations as 


Yi do(@1) di(ti) +++ ba—1(21) 60 el 
Y2 do(@2) di(t2) +++ da—1(x2) 0, e2 
ae ear eae ale (7.8) 
‘i do(an) di(tn) ++ ba—1(tn)]| [Oa-1 en 
RN Fey 
y x 6 e 


Let us look at some examples. 


Example 7.1. (Quadratic fitting) Consider the linear regression problem using a 
quadratic equation: 


a ax? + ban +c, 


Express this equation in matrix-vector form. 


Solution. The matrix-vector expression is 


B 
Y1 ry 

3) 
Y2 v9 


The MATLAB and Python programs for Example 7.1 are shown below. A numerical 
example is illustrated in Figure 7.6. 


N = 50; 
x = rand(N,1)*1; 
a= -2.5; 
b 
Cc 


1.3; 
1.23 
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© data 
mum fitted curve ° 


0 1 1 1 1 
0 0.2 0.4 0.6 0.8 1 


Figure 7.6: Example: Our goal is to fit the dataset of 50 data points shown above. The model we use 
is go(an) = azz + ban +c, forn=1,...,N. 


y = akx.72 + b¥x + c + 1I*rand(size(x)); 


N = length(x) ; 

X = [ones(N,1) x(:) x(:).72]; 

beta = X\y(:); 

t = linspace(0, 1, 200); 

yhat = theta(3)*t.~2 + theta(2)*t + theta(1); 


plot(x,y, ?9’?,’?LineWidth’,2); hold on; 
plot(t,yhat,’r’,’LineWidth’ ,6); 


# Python code to fit data using a quadratic equation 
import numpy as np 
import matplotlib.pyplot as plt 


50 
np.random.rand(N) 
= -2.5 
= 1.3 
1.2 
a*xx**2 + b*¥x + c + 0.24np.random.randn(N) 


X = np.column_stack((np.ones(N), x, x**2)) 
theta = np.linalg.1lstsq(X, y, rcond=None) [0] 
t = np.linspace(0,1,200) 

yhat = theta[0] + theta[1]*t + theta[2] *t**2 
plt.plot(x,y,’o’) 
plt.plot(t,yhat,’r’,linewidth=4) 


The generalization to polynomials of arbitrary order is to replace the model with 
d-1 
Jo(@n) = ~~ px”, 
p=0 
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where p = 0,1,...,d—1 represent the orders of the polynomials and 60,...,@q—1 are the 
regression coefficients. In this case, the matrix system is 


d—1 


Y1 Loy or ay 6 e1 
d-1 
Y2 1 v2 ttt Bo 0, €2 
= . : + ; 
d-1 
YN l an ++ “Ny Oa—1 en 


which again is in the form of y= X@+e. 


Example 7.2. (Legendre polynomial fitting) Let theG ys, be a set of Legendre 
polynomials (see discussions below), and consider the linear regression problem using 


d-1 
ie DUN Ca) i NE 
p=0 


Express this equation in matrix-vector form. 


Solution. The matrix-vector expression is 


Yi Lo(@1)  £1(@1) +++ La-1(#1) 
Lo (2) Ly (x2) aes La-1(£2) 


ion) few. ss Mace 


Legendre polynomials are orthogonal polynomials. In conventional polynomials, the 
functions {x,2?,a°,...,2?} are not orthogonal. As we increase p, the set of functions 
{x, x, a°,...,x?} will have redundancy, which will eventually result in the matrix X being 
noninvertible. 

The pth-order Legendre polynomial is denoted by L,(x). Using the Legendre polyno- 
mials as the building block of the regression problem, the model is expressed as 


d—-1 
def 
go(2) = >) OpLp(x) 
p=0 


= Oo Lo(x) + 1L1 (2) + 05 L(x) fees + 6a—1La-1(2), 
“eo “oo 
=r =4 (322-1) 


where Lo(-), L1(-) and L2(-) are the Legendre polynomials of order 0, 1 and 2, respectively. 
As an example, the first few leading Legendre polynomials are 


Lo(x) = 1, 
Iy(x) = 2, 
Lg 2 
Lo(x) = 3 (32 —1), 
L3(x) = 5 (52° — 32). 
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The order of the Legendre polynomials is always the same as that of the ordinary polyno- 
mials. The shapes of these polynomials are shown in Figure 7.7(a). 


4 
© data 
3 | | === Legendre basis rf 
Polynomial basis qg 
' 
2 oe! 
= 
g 
4 = ss 00% J 
LK ° gwen Pen QO ae 
L » 4 
Of 2, a=" . oo 0%, G0 of. 
a0) ° p Be, I ed 
1 o as Po 
° ° 
2 ! 
1 0.5 0 0.5 1 


Figure 7.7: (a) The first 5 leading Legendre polynomials plotted in the range of —1 < x < 1. (b) Fitting 
the data using an ordinary polynomial and a Legendre polynomial. 


Figure 7.7(b) demonstrates a fitting problem using the Legendre polynomials. You 
can see that the fitting is just as good as that of the ordinary polynomials (which should 
be the case). However, if we compare the coefficients, we observe that the magnitude of 
the Legendre coefficients is smaller (see Table 7.1). In general, as the order of polynomials 
increases and the noise grows, the ordinary polynomials will become increasingly difficult to 
fit the data. 


nnn 8 
64 03 02 0, 60 


Ordinary polynomials 5.3061 3.3519  -—3.6285 —1.8729 0.1540 
Legendre polynomials 1.2128 1.3408 0.6131 0.1382 0.0057 
De 


Table 7.1: The regression coefficients of an ordinary polynomial and a Legendre polynomial. Note that 
while both polynomials can fit the data, the Legendre polynomial coefficients have smaller magnitudes. 


Calling Legendre polynomials for regression is not difficult in MATLAB and Python. 
Specifically, one can call legendreP in MATLAB and scipy.special.eval_legendre in 
Python. 


% MATLAB code to fit data using Legendre polynomials 
N = 50; 

x = 1*(rand(N,1)*2-1); 

a = [-0.001 0.01 +0.55 1.5 1.2]; 

y = a(1)*legendreP(0,x) + a(2)*legendreP(1,x) +... 


+ a(3)*legendreP(2,x) + a(4)*legendreP(3,x) + ... 
+ a(5)*legendreP(4,x) + 0.5*randn(N,1); 


X = [legendreP(0,x(:)) legendreP(1,x(:)) 
legendreP(2,x(:)) legendreP(3,x(:)) 
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legendreP(4,x(:))]; 
beta = X\y(:); 


t = linspace(-1, 1, 200); 

yhat = beta(1)*legendreP(0,t) + beta(2)*legendreP(1,t) + ... 
+ beta(3)*legendreP(2,t) + beta(4)*legendreP(3,t) + ... 
+ beta(5)*legendreP (4,t) ; 

plot(x,y,’ko’,’LineWidth’ ,2,’MarkerSize’,10); hold on; 

plot(t,yhat, ’LineWidth’ ,6,’Color’,[0.9 0 0]); 


import numpy as np 
import matplotlib.pyplot as plt 
from scipy.special import eval_legendre 


50 

np.linspace(-1,1,N) 

np.array([-0.001, 0.01, 0.55, 1.5, 1.2]) 
a[l0]*eval_legendre(0,x) + a[1]*eval_legendre(1,x) + \ 
a[2]*eval_legendre(2,x) + a[3]*eval_legendre(3,x) + \ 
a[4]*eval_legendre(4,x) + 0.2*np.random.randn(N) 


np.column_stack((eval_legendre(0,x), eval_legendre(1,x), \ 
eval_legendre(2,x), eval_legendre(3,x), \ 
eval_legendre(4,x))) 
= np.linalg.1lstsq(X, y, rcond=None) [0] 

np.linspace(-1, 1, 50); 

theta[0]*eval_legendre(0,t) + theta[1]*eval_legendre(1,t) + \ 

theta[2]*eval_legendre(2,t) + theta[3]*eval_legendre(3,t) + \ 

theta [4] *eval_legendre(4,t) 


plt.plot(x,y,’o’ ,markersize=12) 
plt.plot(t,yhat, linewidth=8) 
plt.show() 


The idea of fitting a set of data using the Legendre polynomials belongs to the larger 
family of basis functions. In general, we can use a set of basis functions to model the data: 


d-1 
go(2) = S > bp(c), 
p=0 


where {¢,(2) a are the basis functions and Cee are the regression coefficients. The 
constant 69 is often called the bias of the regression. 
Choice of the ¢,(x) can be extremely broad. One can choose the ordinary polynomials 


gp(a) = «? or the Legendre polynomial ¢,(«) = L,(x). Other choices are also available: 
e Fourier basis: ¢,(a) = e/“?”, where w, is the pth carrier frequency. 
e Sinusoid basis: ¢,(2) = sin(w,a), which is same as the Fourier basis but taking the 


imaginary part. 
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2 
e Gaussian basis: ¢p(x) = Jina exp {- Ge) i" where (jp, 7p») are the model param- 
2 P 
eters. 


Evidently, by choosing different basis functions we have different ways to fit the data. There 
is no definitive answer as to which functions are better. Statistical techniques such as model 
selections are available, but experience will tell you to align with one and not the other. It 
is frequently more useful to have some domain knowledge rather than resorting to various 
computational techniques. 


How to fit data using basis functions 
e Construct this equation: 


Yi do(t1) o1(%1) ++: da—1(21) 9 
$o(x2) $1 (#2) ae a—1(X2) 0, 


on Oe) me Br 
Se ee 


y x 0 


The functions @,(«) are the basis functions, e.g., ¢,)(z) = x? for ordinary poly- 
nomials. 


You can replace the polynomials with the Legendre polynomials. 
You can also replace the polynomials with other basis functions. 
Solve for @ by 


@ =argmin ||y— X6||?. 
9 


Example 7.3. (Autoregressive model) Consider a two-tap autoregressive model: 
Yn = AYn—-1 + bYyn—2, 
where we assume yo = y—1 = O. Express this equation in the matrix-vector form. 


Solution. The matrix-vector form of the equation is 


In general, we can append more previous samples to predict the future. The general 
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expression is 
L 
t= >" Or tine, n=1,2,...,N, 
é=1 


where ¢ = 1,2,...,£ denote the previous L samples of the data and {61,...,@,} are the 
regression coefficients. If we do this we see that the matrix expression is 


[| Yo Y-1 Y-2 07) YI-L [| 
Yo Y1 Yo Y-1 *"' Ya-L | [ 4, €9 
Y3 Y2 Y1 Yo “* YB-L | | 5 €3 
ya | — | YB Y2 Yrooct* Y4a-L | + |e 
Or 
. Sa’ 
aN. yn-1 YN-2 YN-3 : Yn-L] = EN 
eee 
=y -x 


Observe the pattern associated with this matrix X. Each column is a one-entry shifted 
version of the previous column. This matrix is called a Toeplitz matrix. 

The MATLAB (and Python) code for calling and using the Toeplitz matrix is shown 
below. 


MATLAB code for auto-regressive model 
500; 
cumsum(0.2*randn(N,1)) + 0.05*randn(N,1); % generate data 


100; % use previous 100 samples 
[0; y(1:400-1)]; 
zeros(1,L); 
toeplitz(c,r); Toeplitz matrix 
theta = X\y(1:400); % solve y = X theta 
yhat = X*theta; % prediction 
plot (y(1:400) , ?ko’?,’LineWidth’ ,2);hold on; 
plot (yhat(1:400),’r’,’LineWidth’ ,4); 


# Python code for auto-regressive model 
import numpy as np 

import matplotlib.pyplot as plt 

from scipy.linalg import toeplitz 


N = 500 
y = np.cumsum(0.2*np.random.randn(N)) + 0.05*np.random.randn(N) 


= 100 

= np.hstack((0, y[0:400-1])) 

np.zeros(L) 

= toeplitz(c,r) 

theta = np.linalg.1lstsq(X, y[0:400], rcond=None) [0] 
yhat = np.dot(X, theta) 


Ss RAF 
i] 
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plt.plot(y[0:400], ’o’) 
plt.plot (yhat [0:400] , linewidth=4) 


The plots generated by the above programs are shown in Figure 7.8(a). Note that we 
are doing an interpolation, because we are predicting the values within the training dataset. 


2 2 
0 
-2 
-4 
6 1 1 1 ! 6 ! 1 1 : 
0 100 200 300 400 500 0 100 200 300 400 500 


Figure 7.8: Autoregressive model on a simulated dataset, using L = 100 coefficients. (a) Training data. 
Note that the model trains very well on this dataset. (b) Testing data. When tested on future data, the 
autoregressive model can still predict for a few samples but loses track when the time elapsed grows. 


We now consider extrapolation. Given the training data, we can find the regression 
coefficients by solving the above linear equation. This gives us @. To predict the future 
samples we need to return to the equation 


L 

tn = > Int, ,  n=1,2,...,N, 
#1 =previous estimate 

where Y,,_¢ are the previous estimates. For example, if we are given 100 days of stock prices, 

then predicting the 101st day’s price should be based on the L days before the 101st. A 

simple for-loop suffices for such a calculation. 

Figure 7.8(b) shows a numerical example of extrapolating data using the autoregressive 
model. In this experiment we use N = 400 samples to train an autoregressive model of order 
L = 100. We then predict the data for another 100 data points. As you can see from the 
figure, the first few samples still look reasonable. However, as time increases, the model 
starts to lose track of the real trend. 

Is there any way we can improve the autoregressive model? A simple way is to increase 
the memory L so that we can use a long history to predict the future. This boils down to 
the long-term running average of the curve, which works well in many cases. However, if 
the testing data does not follow the same distribution as the training data (which is often 
the case in the real stock market because unexpected news can change the stock price), 
then even the long-term average will not be a good forecast. That is why data scientists on 
Wall Street make so much money: they have advanced mathematical tools for modeling the 
stock market. Nevertheless, we hope that the autoregressive model provides you with a new 
perspective for analyzing data. 

The summary below highlights the main ideas of the autoregressive model. 
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What is the autoregressive model? 


e It solves this problem 


e The number of taps in the past history would affect the memory and hence the 
long-term forecast. 


e Solve for 8 by a 
6 =argmin ||y— X6||?. ald) 
OcER¢ 


7.1.4 Overdetermined and underdetermined systems 


The sub-section requires knowledge of some concepts in linear algebra that can be found 
in standard references.” 


*Carl Meyer, Matriz Analysis and Applied Linear Algebra, SIAM, 2000. 


Let us now consider the theoretical properties of the least squares linear regression 
problem, which is an optimization: 


@ = argmin ||y — X6||?. (P1) 
OcR4 


We observe that the objective value of this optimization problem can go to zero if and only 
if the minimizer @ is the solution of the system of linear equations 


Find @ such that y = X@. (P2) 


We emphasize that Problem (P1) and Problem (P2) are two different problems. Even if we 
cannot solve Problem (P2), Problem (P1) is still well defined, but the objective value will 
not go to zero. This subsection aims to draw the connection between the two problems and 
discuss the respective solutions. We will start with Problem (P2) by considering two shapes 
of the matrix X. 


Overdetermined system 


Problem (P2) is called overdetermined if X ¢€ R%*¢ is tall and skinny, i.e., N > d. This hap- 
pens when you have more rows than columns, or equivalently when you have more equations 
than unknowns. When N > d, Problem (P2) has a unique solution 0 = (X"X)"!XTy if 
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and only if X7X is invertible, or equivalently if and only if the columns of X are linearly in- 
dependent. A technical description of this is that X has a full rank, denoted by rank(X) = d. 
When rank(X) = d, Problem (P1) has a unique global minimizer 6= (X?X)-! XT y, which 
is the same as the unique solution of Problem (P2). 


XTX ij tibl 
over-determined y [| aan (PI) - (Fz) 
0 =(X7X)1XxTy @=(X?X)'xTy 
unique global minimizer unique solution 


N| x Z| 
y € range > objective value —+ 0 
global minimizer 
ae not unique 
objective value 4 0 
global minimizer no solution 
rank (X) <d “S range space(X ) not unique 


infinitely 
many solutions 


Figure 7.9: Hierarchy of the solutions of an overdetermined system. An overdetermined system uses 
a tall and skinny matrix X. The rank of a matrix X is defined as the largest number of independent 
columns we can find in X. If rank(X) = d, the matrix X7 X is invertible, and Problem (P2) will have 
a unique solution. If rank(X) < d, then the solution depends on whether the particular observation y 
lives in the range space of X. If yes, Problem (P2) will have infinitely many solutions because there is 
a nontrivial null space. If no, Problem (P2) will have no solution because the system is incompatible. 


If the columns of X are linearly dependent so that X TX is not invertible, we say 
that X is rank-deficient (denoted as rank(X) < d). In this case, Problem (P2) may not 
have a solution. We say that it may not have a solution because it is still possible to have a 
solution. It all depends on whether y can be written as a linear combination of the linearly 
independent columns of X. 

If yes, we say that y lives in the range space of X. The range space of X is defined 
as the set of vectors {z|z = Xa, for some a}. If rank(X) = d, all y will live in the range 
space of X. But if rank(X) < d, only some of the y will live in the range space of X. 
When this happens, the matrix X must have a nontrivial null space. The null space of X 
is defined as the set of vectors {z| Xz = 0}. A nontrivial null space will give us infinitely 
many solutions to Problem (P2). This is because if @ is the solution found in the range 
space so that y = Xa, then we can pick any z from the null space such that Xz = 0. 
This will lead to another solution a+ z such that X(a+z) = Xa+0= yy. Since we have 
infinitely many choices of such z’s, there will be infinitely many solutions to Problem (P2). 

Although there are infinitely many solutions to Problem (P2), all of them are the global 
minimizers of Problem (P1). They can make the objective value equal to zero because the 
equality y = X@ holds. However, the solutions to Problem (P2) are not unique since the 
objective function is convex but not strictly convex. 

If y does not live in the range space of X, we say that Problem (P2) is incompatible. 
If a system of linear equations is incompatible, there is no solution. However, even when 
this happens, we can still solve the optimization Problem (P1), but the objective value will 
not reach 0. The minimizer is a global minimizer because the objective function is convex, 
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but the minimizer is not unique. 


Underdetermined system 


Problem (P2) is called underdetermined if X is fat and short, i.e., N < d. This happens 
when you have more columns than rows, or equivalently when you have more unknowns than 
equations. In this case, X7X is not invertible, and so we cannot use @ = (X7_X)-1XTy 
as the solution. However, if rank(X) = N, then any y will live in the range space of X. But 
because X is fat and short, there exists a nontrivial null space. Therefore, Problem (P2) 
will have infinitely many solutions, attributed to the vectors generated by the null space. 
For this set of infinitely many solutions, the corresponding Problem (P1) will have a global 
minimizer, and the objective value will be zero. However, the minimizer is not unique. This 
is the first case in Figure 7.10. 


under-determined system 
y € range space(X ) (P1) (P2) (P3) 


objective value + 0 infinitely feasible 
Z global minimizer many solutions @ = X7(XX7)-ly 
not unique ae 
_ global minimizer 
rank(X ) = N unique 
N xX 
y € range space( X ) a 
d objective value + 0 infinitely feasible 
global minimizer Many solutions 
not unique 
N ily 
rank(X) < N — 
( ) objective value 4 0 infeasible 


aint no solution ; 
y ¢ range space(X) global minimizer no solution 


not unique 


Figure 7.10: Hierarchy of the solutions of an underdetermined system. An underdetermined system uses 
a fat and short matrix X. The rank of a matrix X is defined as the largest number of independent 
columns we can find in X. If rank(X) = N, we will have infinitely many solutions. If rank(X) < N, 
then the solutions depends on whether the particular observation y lives in the range space of X. If yes, 
Problem (P2) will have infinitely many solutions because there is a nontrivial null space. If no, Problem 
(P2) will have no solution because the system is incompatible. 


There are two other cases in Figure 7.10, which occur when rank(X) < N: 


e (i) If y is in the range space of X, Problem (P2) will have infinitely many solutions. 
Since Problem (P2) remains feasible, the objective function of Problem (P1) will go 
to zero. 


e (ii) If y is not in the range space of X, the system in Problem (P2) is incompatible 
and there will be no solution. The objective value of Problem (P1) will not go to zero. 


If an underdetermined system has infinitely many solutions, we need to pick and choose. 
One of the possible approaches is to consider the optimization 


6 = argmin ||@||? subjectto XO = y. (P3) 
OcR¢ 


This optimization is different from Problem (P1), which is an unconstrained optimization. 
Our goal is to minimize the deviation between X@ and y. Problem (P3) is constrained. Since 
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we assume that Problem (P2) has infinitely many solutions, the constraint set y = XO 
is feasible. Among all the feasible choices, we pick the one that minimizes the squared 
norm. Therefore, the solution to Problem (P3) is called the minimum-norm least squares. 
Theorem 7.2 below summarizes the solution. If y does not live in the range space of X, 
then Problem (P2) does not have a solution. Therefore, the constraint in P3 is infeasible, 
and hence the optimization problem does not have a minimizer. 


Theorem 7.2. Consider the underdetermined linear regression problem where N < d: 


6 = argmin ||O\|* subject to y = XO, 
dcR¢ 


where X € RN*X4, @ ER, and y € R%. If rank(X) = N, then the linear regression 
problem will have a unique global minimum 


@= X™(XXT)ly. (7.12) 


This solution is called the minimum-norm least-squares solution. 


Proof. The proof of the theorem requires some knowledge of constrained optimization. 
Consider the Lagrangian of the problem: 


£(8,A) = ||9||? + A7(XO—y), 


where A is called the Lagrange multiplier. The solution of the constrained optimization is 
the stationary point of the Lagrangian. To find the stationary point, we take the derivatives 
with respect to 0 and X. This yields 

Vol = 20+ X*A=0, 

Val=XO- y= 0. 


The first equation gives us 80 = —X vA 2. Substituting it into the second equation, and 
assuming that rank(X) = N so that X‘X is invertible, we have 


x (-x7/2) == 


which implies that X = —2(X_X7)~!y. Therefore, 9 = X*(X X7)~1y. 


The end of this subsection. Please join us again. 


7.1.5 Robust linear regression 


This subsection is optional for a first reading of the book. 


The linear regression we have discussed so far is based on an important criterion, 
namely the squared error criterion. We chose the squared error as the training loss because 
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it is differentiable and convex. Differentiability allows us to take the derivative and locate 
the minimum point. Convexity allows us to claim a global minimizer (also unique if the 
objective function is strictly convex). However, such a nice criterion suffers from a serious 
drawback: the issue of outliers. 

Consider Figure 7.11. In Figure 7.11(a), we show a regression problem for N = 50 
data points. Our basis functions are the ordinary polynomials in the fourth order. Everything 
looks fine in the figure. We intervene in the data by randomly altering a few of them so that 
their values are off. There are only a handful of these outliers. We run the same regression 
analysis again, but we observe (see Figure 7.11(b)) that our fitted curve has been distorted 
quite significantly. 


4 T T T 5 o—o Oo; 1-8 © 


3 


0 
o 
° 
“Ill 0 data -1}| © data % ° ° 
=== fitted curve oo === fitted curve (oko) 
-2 ; ; ; -2 ; : ; 
-1 -0.5 0 0.5 1 1 -0.5 0 0.5 1 
(a) (-)? without outlier (b) (-)? with outlier 


Figure 7.11: Linear regression using the squared error as the training loss suffers from outliers. (a) The 
regression performs well when there is no outlier. (b) By adding only a few outliers, the regression curve 
has already been distorted. 


This occurs because of the squared error. By the definition of a squared error, our 
training loss is 


Ewain(8) = 3 (un - tolen)) 


Without loss of generality, let us assume that one of these error terms is large because of an 
outlier. Then the training loss becomes 


2 


Etrain(@) = (n = soles) + (x = soles) + (1 = so(e:)) 
ee S 


small small large small 


fee (uw ~ gates) : 


Here is the daunting fact: If one or a few of these individual error terms are large, the 
square operation will amplify them. As a result, the error you see is not just large but large?. 
Moreover, since we put the squares to the small errors as well, we have small? instead of 
small. When you try to weigh the relative significance between the outliers and the normal 
data points, the outliers suddenly have a very large contribution to the error. Since the goal 
of linear regression is to minimize the total loss, the presence of the outliers will drive the 
optimization solution to compensate for the large error. 
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One possible solution is to replace the squared error by the absolute error, such that 
N 


Etrain (0) = S- 


n=1 


Yn — Go(Ln) 


This is a simple modification, but it is very effective. The reason is that the absolute error 
keeps the small just as small, and keeps the large just as large. There is no amplification. 
Therefore, while the outliers still contribute to the overall loss, their contributions are less 
prominent. (If you have a lot of strong outliers, even the absolute error will fail. If this 
happens, you should go back to your data collection process and find out what has gone 
wrong.) 

When we use the absolute error as the training loss, the resulting regression problem is 
the least absolute deviation regression (or simply the robust regression). The tricky thing 
about the least absolute deviation is that the training loss is not differentiable. In other 
words, we cannot take the derivative and find the optimal solution. The good news is that 
there exists an alternative approach for solving this problem: using linear programming 
(implemented via the simplex method). 


Solving the robust regression problem 
Let us focus on the linear model 


Jo (Ln) _ ze, 


where &n = [¢0(2n),---,¢a-1(@n)]* € R®@ is the nth input vector for some basis functions 
{dp on and @ = [0,...,9a_1]’ € R®@ is the parameter. Substituting this into the training 
loss, the optimization problem is 
N 
tiled T 
minimize — 2,6). 
deRt ~~ et | 


Here is an important trick. The idea is to express the problem as an equivalent problem 


N 
minimize Un 
OER4,uERN 
n=1 
subject to Un = |yn —226|, n=1,...,N. 


There is a small but important difference between this problem and the previous one. In the 
first problem, there is only one optimization variable @. In the new problem, we introduce an 
additional variable wu = [u1,...,uy]7 and add a constraint u, = |y,—2x20| forn =1,...,N. 
We introduce u so that we can have some additional degrees of freedom. At the optimal 
solution, u, must equal to |y, — 27 6|, and so the corresponding @ is the solution of the 
original problem. 

Now we note that x = |a| is equivalent to x > a and x > —a. Therefore, the constraint 
can be equivalently written as 


N 
minimize : 7.13 
O€R4,ueRN dt ae) 
subject to un > —(yn — 226), n=1,...,N 
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In other words, we have rewritten the equality constraint as a pair of inequality constraints 
by removing the absolute signs. 
The optimization in Equation (7.13) is in the form of a standard linear programming 
problem. A linear programming problem takes the form of 
minimize c!a (7.14) 
weRk 
subject to Aw <b, 


for some vectors c € R*, b € R™, and matrix A € R™**. Linear programming is a stan- 
dard optimization problem that you can find in most optimization textbooks. On a com- 
puter, if we know c, b and A, solving the linear programming problem can be done using 
built-in commands. For MATLAB, the command is linprog. For Python, the command is 
scipy.optimize.linprog. We will discuss a concrete example shortly. 


% MATLAB command for linear programming 


x = linprog(c, A, b); 


# Python command for linear programming 
linprog(c, A, b, bounds=(None,None), method="revised simplex") 


Given Equation (7.13), the question becomes how to convert it into the standard linear 
programming format. This requires two steps. The first step uses the objective function: 


N d-1 N 
do un = D5 (O)(O) + D3) (un) 
n=1 p=0 n=1 

=[0 0 011 1] M 


Therefore, the vector c has d 0’s followed by N 1’s. 
The second step concerns the constraint. It can be shown that un > —(yn — £20) is 
equivalent to x0 — Un < Yn. Written in the matrix form, we have 


ay -1 0: O 6 Yi 
a? QO -1l -. O Uy Z Ya 
a 0 0 —l UN YUN 
which is equivalent to 
7) 
[x —1| HM oT (7.15) 


where I € RN*% jis the identity matrix. 
Similarly, the other constraint un > (yn — 220) is equivalent to —x2@ — un < —Yn. 
Written in the matrix form, we have 


-af -1 0 +: QO 6 —Y1 
—ae 0 =| ae Ge 0 Uy, —Y2 

< , 
—at 0 0 ree =] UN —JUN 
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which is equivalent to 


Putting everything together, we have finally arrived at the linear programming problem 


0 
minimize |Og 1 
OER4,wERN # n] HM 


x -I\ {fo y 
subject to fee =| HM < i ; 


where 0g € R®@ is an all-zero vector, and 1y € RY is an all-one vector. It is this problem 
that solves the robust linear regression. 

Let us look at how to implement linear programming to solve the robust regression 
optimization. As an example, we continue with the polynomial fitting problem in which 
there are outliers. We choose the ordinary polynomials as the basis functions. To construct 
the linear programming problem, we need to define the matrix A and the vectors c and 
b according to the linear programming form. This is done using the following MATLAB 
program. 


MATLAB code to demonstrate robust regression 

50; 
linspace(-1,1,N)’; 
[-0.001 0.01 0.55 1.5 1.2]; 
a(1)*legendreP(0,x) + a(2)*legendreP(1,x) + ... 
a(3)*legendreP(2,x) + a(4)*legendreP(3,x) + ... 
a(5)*legendreP(4,x) + 0.2*randn(N,1); 

idx = [10, 16, 23, 37, 45]; 

y(idx) = 5; 


[x€:).70 x€:).71 x€:).72 x€:).73 x(€:).74]; 


= [y(:); -yC:)]; 
[zeros(1,5) ones(1,N)]’; 
theta = linprog(c, A, b); 


X 
A [X -eye(N); -X -eye(N)]; 
b 
é 


t linspace(-1,1,200)’; 

yhat = theta(1) + theta(2)*t(:) +... 
theta(3)*t(:).°2 + theta(4)*t(:).73 + ... 
theta(5)*t(:).74; 

plot(x,y, ?ko’,’?LineWidth’,2); hold on; 

plot(t,yhat,’r’,’LineWidth’ ,4); 


In this set of commands, the basis vectors are defined as w! = [4(rp),---,¢0(@n)|7, for 
n=1,...,N. The matrix I is constructed by using the command eye(N), which constructs 
the identity matrix of size N x N. The rest of the commands are self-explanatory. Note that 
the solution to the linear programming problem consists of both 8 and u. To squeeze 8 we 
need to locate the first d entries. The remainder is w. 

Commands for Python are similar, although we need to call np. hstack and np. vstack 
to construct the matrices and vectors. The main routine is linprog in the scipy.optimize 


416 


7.1. PRINCIPLES OF REGRESSION 


library. Note that for this particular example, the bounds are bounds=(None,None), or 
otherwise Python will search in the positive quadrant. 


# Python code to demonstrate robust regression 
import numpy as np 

import matplotlib.pyplot as plt 

from scipy.special import eval_legendre 

from scipy.optimize import linprog 


50 
np.linspace(-1,1,N) 
np.array([-0.001, 0.01, 0.55, 1.5, 1.2]) 
a[0]*eval_legendre(0,x) + a[1]*eval_legendre(1,x) + \ 
a[2]*eval_legendre(2,x) + a[3]*eval_legendre(3,x) + \ 
al[4]*eval_legendre(4,x) + 0.2*np.random.randn(N) 
idx = [10,16,23,37,45] 
ylidx] = 5 
np.column_stack((np.ones(N), x, x**2, x**3, x**4)) 
= np.vstack((np.hstack((X, -np.eye(N))), np.hstack((-X, -np.eye(N))))) 
= np.hstack((y,-y)) 
np.hstack((np.zeros(5), np.ones(N))) 


= linprog(c, A, b, bounds=(None,None), method="revised simplex") 
theta = res.x 


t = np.linspace(-1,1,200) 

yhat = theta[0]*np.ones(200) + theta[1]*t + theta[2]*t**2 + \ 
theta[3] *t**3 + theta[4] *t**4 

plt.plot(x,y,’o’ ,markersize=12) 

plt.plot(t,yhat, linewidth=8) 

plt.show() 


The result of this experiment is shown in Figure 7.12. It is remarkable to see that the 
robust regression result is almost as good as the result would be without outliers. 

If robust linear regression performs so well, why don’t we use it all the time? Why 
is least squares regression still more popular? The answer has a lot to do with the com- 
putational complexity and the uniqueness of the solution. Linear programming requires an 
algorithm for a solution. While we have very fast linear programming solvers today, the com- 
putational cost of solving a linear program is still much higher than solving a least-squares 
problem (which is essentially a matrix inversion). 

The other issue with robust linear regression is the uniqueness of the solution. Lin- 
ear programming is known to have degenerate solutions when the constraint set (a high- 
dimensional polygon) touches the objective function (which is a line) at one of its edges. 
The least-squares fitting does not have this problem because the optimization surface is a 
parabola. Unless the matrix X7X is noninvertible, the solution is guaranteed to be the 
unique global minimum. Linear programming does not have this convenient property. We 
can have multiple solutions @ that give the same objective value. If you try to interpret your 
result by inspecting the magnitude of the 0’s, the nonuniqueness of the solution would cause 
problems because your interpretation can be swiped immediately if the linear programming 
gives you a nonunique solution. 
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“1 0.5 0 0.5 1-1 0.5 0 0.5 1 
(a) Ordinary (-)? regression with outliers  (b) Robust | -| regression with outliers 


Figure 7.12: (a) Ordinary linear regression using (-) as the training loss. In the absence of any outlier, 
the regression performs well. (b) Robust linear regression using | - | as the training loss. Note that even 
in the presence of outliers, the robustness regression perform reasonably well. 


End of this subsection. Please join us again. 


Closing remark. The principle of linear regression is primarily to set up a function to fit 
the data. This, in turn, is about finding a set of good basis functions and minimizing the 
appropriate training loss. Selecting the basis is usually done in several ways: 


e The problem forces you to choose certain basis functions. For example, suppose you 
are working on a disease dataset. The variates are height, weight, and BMI. You do 
not have any choice here because your goal is to see which factor contributes the most 
to the cause of the disease. 


e There are known basis functions that work. For example, suppose you are working on 
a speech dataset. Physics tells us that Fourier bases are excellent representations of 
these sinusoidal functions. So it would make more sense to use the Fourier basis than 
the polynomials. 


e Sometimes the basis can be learned from the data. For example, you can run principal- 
component analysis (PCA) to extract the basis. Then you can run the linear regression 
to compute the coefficients. This is a data-driven approach and could apply to some 
problems. 


7.2 Overfitting 


The regression principle we have discussed in the previous section is a powerful technique 
for data analysis. However, there are many ways in which things can fall apart. We have 
seen the problem of outliers, where perturbations of one or a few data points would result 
in a big change in the regression result, and we discussed some techniques to overcome the 
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outlier problem, e.g., using robust regression. In addition to outliers, there are other causes 
of the failure of the regression. 

In this section, we examine the relationship between the number of training samples 
and the complexity of the model. For example, if we decide to use polynomials as the basis 
functions and we have only N = 20 data points, what should be the order of the polynomials? 
Shall we use the 5th-order polynomial, or shall we use the 20th-order? Our goal in this section 
is to acquire an understanding of the general problem known as overfitting. Then we will 
discuss methods for mitigating overfitting in Section 7.4. 


7.2.1 Overview of overfitting 


Imagine that we have a dataset containing N = 20 training samples. We know that the data 
are generated from a fourth-order polynomial with Legendre polynomials as the basis. On 
top of these samples, we also know that a small amount of noise corrupts each sample, for 
example, Gaussian noise of standard deviation o = 0.1. 

We have two options here for fitting the data: 


e Option 1: h(a) = ee §,L,(x), which is a 4th-order polynomial. 
e Option 2: g(x) = sar 6,L,(x), which is a 50th-order polynomial. 


Model 2 is more expressive because it has more degrees of freedom. Let us fit the data using 
these two models. Figure 7.13 shows the results. However, what is going on with the 50th- 
order polynomial? It has gone completely wild. How can the regression ever choose such a 
terrible model? 


271 © data | 271 © data 
=== fitted curve ===== fitted curve 


“1 05 0 05 1-4 05 0 0.5 1 
(a) 4th-order polynomial (b) 50th-order polynomial 


Figure 7.13: Fitting data using a 4th-order polynomial and a 50th-order polynomial. 


Here is an even bigger surprise: If we compute the training loss, we get 


2 
Etrain(h) = a3 (u, = h(n) = 0.0063, 


n=1 


N 2 
1 
Etrain(g) = W S- (u, - alen)) = 5.7811 x 10774. 
n=1 
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Thus, while Model 2 looks wild in the figure, it has a much lower training loss than Model 1. 
So according to the training loss, Model 2 fits better. 

Any sensible person at this point will object, since Model 2 cannot possibly be better, 
for the following reason. It is not because it “looks bad”, but because if you test the model 
with an unseen sample it is almost certain that the testing error will explode. For example, 
in Figure 7.13(a) if we look at « = 0, we would expect the predicted value to be close 
to y = 0. However, Figure 7.13(b) suggests that the predicted value is going to negative 
infinity. It would be hard to believe that the negative infinity is a better prediction than the 
other one. We refer to this general phenomenon of fitting very well to the training data but 
generalizing poorly to the testing data as overfitting. 


What is overfitting? 
Overfitting means that a model fits too closely to the training samples so that it 


fails to generalize. 


Overfitting occurs as a consequence of an imbalance between the following three factors: 


e Number of training samples N. If you have many training samples, you should learn 
very well, even if the model is complex. Conversely, if the model is complex but does 
not have enough training samples, you will overfit it. The most serious problem in 
regression is often insufficient training data. 


e Model order d. This refers to the complexity of the model. For example, if your model 
uses a polynomial, d refers to the order of the polynomial. If your training set is too 
small, you need to use a less complex model. The general rule of thumb is: “less is 
more”. 


e Noise variance o?. This refers to the variance of the error e, you add to the data. 
The model we assumed in the previous numerical experiment is that 


Yn = 9(En) + en, n=1,...,N. 


where e,, ~ Gaussian(0,07). If o increases, it is inevitable that the fitting will be- 
come more difficult. Hence it would require more training samples, and perhaps a less 
complex model would work better. 


7.2.2 Analysis of the linear case 


Let us spell out the details of these factors one by one. To make our discussion concrete, we 
will use linear regression as a case study. The general analysis will be presented in the next 
section. 


Notations 


e Ground Truth Model. To start with, we assume that we have a population set D 
containing infinitely many samples (a, y) drawn according to some latent distributions. 
The relationship between x and y is defined through an unknown target function 


y=f(x)t+e, 
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where e ~ Gaussian(0, 0?) is the noise. For our analysis, we assume that f(-) is linear, 
so that 
f(x) = 27 @, 


where @* € R?@ is the ground truth model parameter. Notice that f(-) is deterministic, 
but e is random. Therefore, any randomness we see in y is due to e. 


Training and Testing Set. From D, we construct two datasets: the training data set 
Dirain that contains training samples {(a#1,y1),...,(@n,yn)} and the testing dataset 
Dress that contains {(a1,y1),---,(@a,ym)}. Both Dtrain and Dtest are subsets of D. 


Predictive Model. We consider a predictive model gg(-). For simplicity, we assume 
that go(-) is also linear: 
go(x) = «7. 


Given the training dataset D = {(#1, y1),..-,(@Nn,yn)}, we construct a linear regres- 
sion problem: _ 
6 =argmin ||X0— y||?. 
OcR4 


Throughout our analysis, we assume that N > d so that we have more training data 
than the number of unknowns. We further assume that X7X is invertible, and so 
there is a unique global minimizer 


6 = (XTX) IXTy. 


Training Error. Given the estimated model parameter 6, we define the in-sample 
prediction as - 

Grain = X train9, 
where Xtrain = X is the training data matrix. The in-sample prediction is the pre- 
dicted value using the trained model for the training data. The corresponding error 
with respect to the ground truth is called the training error: 


n~ 


Etrain (0) 


ane 
2 | FpllBean —vl?]. 


where N is the number of training samples in the training dataset. Note that the 
expectation is taken with respect to the noise vector e, which follows the distribution 
Gaussian(0, 07). 
Testing Error. During testing, we construct a testing matrix X tes. This gives us the 
estimated values Yroct: 

Viest = X test 9. 


The out-sample prediction is the predicted value using the trained model for the testing 
data. The corresponding error with respect to the ground truth is called the testing 
error: 


n~ 


Etest (9) 


wes 
2 | jz ea - wl?) 


where M is the number of testing samples in the testing dataset. 
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Analysis of the training error 


We first analyze the training error, which we defined as 


ies e a 
Cian = | lav  MSE(G, y). (7.16) 


For this particular choice of the training error, we call it the mean squared error (MSE). It 
measures the difference between y and y. 


Theorem 7.3. Let 6° € R®@ be the ground truth linear model parameter, and X € 

R*<¢ be a matrix such that N > d and X?X is invertible. Assume that the data 

follows the linear model y = X 0* +e where e ~ Gaussian(0, 071). Consider the linear 

regression problem @ = argmin ||X0— yl|?, and the predicted value 7 = XO. The 
OER¢ 


mean squared training error of this linear model is 


def 


y eed leone sane d 
Carat = MSE(y, y) aes Ke F \|y ap y|| | S20. (1 = x) ° CONG 


The proof below depends on some results from linear algebra that may be difficult for 
first-time readers. We recommend you read the proof later. 


Proof. Recall that the least squares linear regression solution is 6= (XTX) 1XTy, Since 
y = X@* +e, we can substitute this into the predicted value ¥ to show that 


gy = X@ = X(X™X) 1 XTy = X(X7X) 1X71 (XO +0e) =X + He. 
e—__S -___—S 
=H 


Therefore, substituting y = X0* + He into the MSE, 


~ def 7, de ie 7 1 * * 
MSEG@.y) “E.| 5 Ig - wl? |= Be| 5 I-X0" + He — XO" el? | 


=k. 5 I(t - Del". 


At this point we need to use a tool from linear algebra. One useful identity? is that for any 
veER, 


lvl)? = Tr"). 


3The reason for this identity is that 


Uy V{V2 U1UN 
N V2U1 v3 VQUN 
=>) «= T : : =Tr {v7} 
n=1 . E : 
UNV UN V2 ve, 
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Using this identity, we have that 


I 


se| x IE — Del? | 


I 


TT —I)(H - pr}, 


where we used the fact that E[ee’] = o?I. The special structure of H tells us that H’ = H 
and H’ H = H. Thus, we have (H — I)"(H — I) = I — H. In addition, using the cyclic 
property of trace Tr(AB) = Tr(BA), we have that 


Tr(H) = Tr(X(X7.X)-1X7) 
= Tr((X7 X)1 XTX) =Tr(1 =d. 


Consequently, 


This completes the proof. 


The end of the proof. Please join us again. 


Practice Exercise 1. In the theorem above, we proved the MSE of the prediction y. 
In this example, we would like to prove the MSE for the parameter. Prove that 


A = 2 
MSE(6, 0") & =| |@— 6 | =o (xTx)}. 


Solution. Let us start with the definition: 


“a 2 
MSE(6, 6”) | CX xe | 


= | (XX) OX (XO +6) -—0" 


| 


‘ =8,| |(x™x)-xTe| |]. 


= | OKA XO Xe 0" 
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Continuing the calculation, 


te|xPay xe) | =. (xTAY Te ePx(xTx) |} 
ie Te (XTX) IXTE, ject x(xtx) | 
= Taf (XTX) 1X7 uo te x(xtx)} 


- ot (XTX) XT x(xtxy tt = oe (xt x) 1h, 


Analysis of the testing error 


Similarly to the training error, we can analyze the testing error. The testing error is defined 
as 


es def yp es 2 
Fie = MSEG.¥') “Bow | 57 [19-91], (7.18) 
where y = [f1,.--,Ya]7 is a vector of M predicted values and y’ = [y},---, y,,|" is a vector 


of M true values in the testing data.* 

We would like to derive something concrete. To make our analysis simple, we consider 
a special case in which the testing set contains (71,y}),...,(tn,y). That is, the inputs 
£1,..-,@y are identical for both training and testing (for example, suppose that you measure 
the temperature on two different days but at the same time stamps.) In this case, we have 
M =N, and we have X test = Xtrain = X. However, the noise added to the testing data is 
still different from the noise added to the training data. 

With these simplifications, we can derive the testing error as follows. 


Theorem 7.4. Let 0* € R@ be the ground truth linear model parameter, and X € 
RN*¢4 be a matrix such that N > d and X7X is invertible. Assume that the training 
data follows the linear model y = X0* +e, where e ~ Gaussian(0,07I). Consider 
the linear regression problem 6 = (X7X)1XTy, and let 7 = X6. Let X ten = X 
be the testing input data matrix, and define y’ = Xtes0* + e' € RY, with e’ ~ 


Gaussian(0,07I), be the testing output. Then, the mean squared testing error of this 
linear model is 


def zs re 2 d 
oe = MSE gy \— be eal |y—y |) |—o |1e— |. a 
Eves = MSEG,y') = Ee, | I vi o(1+5) (7.19) 


In this definition, the expectation is taken with respect to a joint distribution of (e, e’). 
This is because, in testing, the trained model is based on y of which the randomness is e. 
However, the testing data is based on y’, where the randomness comes from e’. We assume 
that e and e’ are independent i.i.d. Gaussian vectors. 


4In practice, the number of testing samples M can be much larger than the number of training samples N. 
This probably does not agree with your experience, in which the testing dataset is often much smaller than 
the training dataset. The reason for this paradox is that the practical testing data set is only a finite subset 
of all the possible testing samples available. So the “testing error” we compute in practice approximates the 
true testing error. If you want to compute the true testing error, you need a very large testing dataset. 
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As with the previous proof, we recommend you study this proof later. 


Proof. The MSE can be derived from the definition: 


ps , wh pies 2 
MSE.) =Exe|= ld —¥ I" | 


Bee! | ||X0* + He — X0* — el? | 


| ea 


tags | || He — el | 
Since each noise term e, and e}, is an i.i.d. copy of the same Gaussian random variable, by 
using the fact that 
Tr(H) = Tr(X(X7X)-!X7) 
= Tr((X7 X)-1 XTX) =Tr(N =d, 
we have that 
fi) [Ile = e'|| =E, {||Hel|?] —Eee’ ze? HT e’] +Ee [lle|7] 
eo 
=0 


=k, [t-{ Hee™H™}] + Be [D-{e'e7}] 


=Tr {H Ne [ee”| H™| + Tr{Ee [e’e’]} 


= 1 {H Se Ee H"} +Tr{o7Iyxv} 
- ot {HH + Tr {o7Inxn} 
= 0? Tr(Iaxa) +o? Tr{Inxw} = 07(d+N). 


Combining all the terms, 


- an eam d 
MSE(G, y') = Eee |= ||G-y'll | =o? (1+—}, 
N N 


which completes the proof. 


The end of the proof. 


7.2.3 Interpreting the linear analysis results 


Let us summarize the two main theorems. They state that, for N > d, 


def Be a Wi decane 2 2 d 
nat = = Ke | = = rd = 5 2 
Etrain = MSE(y, y) Fag y|| | oO (1 =) (7.20) 
Etest = MSE(G, y’) = E = lg —y'|? | =e? ee (7.21) 
test — Y,Y ) = Ke,e’ N y y _ N : : 


This pair of equations tells us everything about the overfitting issue. 
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How do Evrain and Etest change w.r.t. 07? 


© Etrain t as a7 t. Thus noisier data are harder to fit. 


© Etest t as a” t. Thus a noiser model is more difficult to generalize. 


The reasons for these results should be clear from the following equations: 


‘ d 
Etrain = e (1 _ =) , 


d 
Chest = o (1 + =) ‘ 


As o? increases, the training error Etrain grows linearly w.r.t. 07. Since the training error 
measures how good your model is compared with the training data, a larger Et;ain means it 
is more difficult to fit. For the testing case, Ezes_ also grows linearly w.r.t. ¢?. This implies 
that the model would be more difficult to generalize if the model were trained using noisier 
data. 


How do Etrain and Exes Change w.r.t. N? 


© Etrain t as N t. Thus more training samples make fitting harder. 


© Etest | as N t. Thus more training samples improve generalization. 


The reason for this should also be clear from the following equations: 


As N increases, the model sees more training samples. The goal of the model is to minimize 
the error with all the training samples. Thus the more training samples we have, the harder 
it will be to make everyone happy, so the training error grows as N grows. For testing, if the 
model is trained with more samples it is more resilient to noise. Hence the generalization 
improves. 


How do Egain and Exes, Change w.r.t. d? 


© Evrain | as d ¢. Thus a more complex model makes fitting easier. 


© Etest | as d +. Thus a more complex model makes generalization harder. 


These results are perhaps less obvious than the others. The following equations tell us that 


d 
Cia a" (+5) (7.22) 
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For this linear regression model to work, d has to be less than N; otherwise, the matrix 
inversion (X"X)~! is invalid. However, as d grows while N remains fixed, we ask the 
linear regression to fit a larger and larger model while not providing any additional training 
samples. Equation (7.22) says that Et;ain will drop as d increases but E;es_ will increase as d 
increases. Therefore, a larger model will not generalize as well if N is fixed. 

If d> N, then the optimization 


@=argmin ||XO— yl? 
OcR¢4 


will have many global minimizers (see Figure 7.10), implying that the training error can go 
to zero. Our analysis of Ejrain and Etest does not cover this case because our proofs require 
(Xx 7X )~+ to exist. However, we can still extrapolate what will happen. When the training 
error is zero, it only means that we fit perfectly into the training data. Since the testing 
error grows as d grows (though not in the particular form shown in Equation (7.22)), we 
should expect the testing error to become worse. 


Learning curve 


The results we derived above can be summarized in the learning curve shown in Figure 7.14. 
In this figure we consider a simple problem where 


Yn = Ao + O12 + En, 


for e,, ~ Gaussian(0, 1). Therefore, according to our theoretical derivations, we have o = 1 
and d = 2. For every N, we compute the average training error €y;ain and the average testing 
error Eyes, and then mark them on the figure. These are our empirical training and testing 
errors. On the same figure, we calculate the theoretical training and testing error according 
to Equation (7.22). 

The MATLAB and Python codes used to generate this learning curve are shown below. 


Nset = round(logspace(1,3,20)); 

E_train = zeros(1,length(Nset)) ; 

E_test = zeros(1,length(Nset)) ; 

a= [1.3, 2.5]; 

for j = 1:length(Nset) 

Nset(j); 

linspace(-1,1,N)’; 

E_train_temp = zeros(1,1000); 

E_test_temp = zeros(1,1000); 

X = [ones(N,1), x(:)]; 

for i = 1:1000 
y a(1) + a(2)*x + randn(size(x)); 
yi = a(1) + a(2)*x + randn(size(x)); 
theta = X\y(:); 
yhat = theta(1) + theta(2)*x; 
E_train_temp(i) = mean((yhat(:)-y(:)).72); 
E_test_temp (i) mean((yhat(:)-y1(:)).72); 


oe 
"oul 


end 
E_train(j) = mean(E_train_temp) ; 
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E_test(j) = mean(E_test_temp) ; 
end 
semilogx(Nset, E_train, ’kx’, ’LineWidth’, 2, ’MarkerSize’, 16); hold on; 
semilogx(Nset, E_test, ’ro’, ’LineWidth’, 2, ’MarkerSize’, 8); 
semilogx(Nset, 1-2./Nset, ’k’, ’LineWidth’, 4); 
semilogx(Nset, 1+2./Nset, ’r’, ’LineWidth’, 4); 


import numpy as np 
import matplotlib.pyplot as plt 


Nset = np.logspace(1,3,20) 
Nset = Nset.astype(int) 
E_train = np.zeros(len(Nset) ) 
E_test = np.zeros(len(Nset) ) 
j in range(len(Nset)): 
= Nset[j] 
= np.linspace(-1,1,N) 
np.array([1, 2]) 
_train_tmp = np.zeros(1000) 
E_test_tmp = np.zeros(1000) 
for i in range(1000): 
y = afO] + a[i]*x + np.random.randn(N) 
X np.column_stack((np.ones(N), x)) 
theta np.linalg.1lstsq(X, y, rcond=None) [0] 
yhat theta[0] + theta[1]*x 
E_train_tmp[i] = np.mean((yhat-y) **2) 
yl alO] + al[i]*x + np.random.randn(N) 
E_test_tmp[il] np.mean((yhat-y1) **2) 
E_train[j] = np.mean(E_train_tmp) 
E_test[j] = np.mean(E_test_tmp) 
plt.semilogx(Nset, E_train, ’kx’) 
plt.semilogx(Nset, E_test, ’ro’) 
plt.semilogx(Nset, (1-2/Nset), linewidth=4, c=’k’) 
plt.semilogx(Nset, (1+2/Nset), linewidth=4, c=’r’) 


The training error curve and the testing error curve behave in opposite ways as N 
increases. The training error Ej;ain increases as N increases, because when we have more 
training samples it becomes harder for the model to fit all the data. By contrast, the testing 
error Etest decreases as NV increases, because when we have more training samples the model 
becomes more robust to noise and unseen data. Therefore, the testing error improves. 

As N goes to infinity, both the training error and the testing error converge. This is 
due to the law of large numbers, which says that the empirical training and testing errors 
should converge to their respective expected values. If the training error and the testing error 
converge to the same value, the training can generalize to testing. If they do not converge to 
the same value, there is a mismatch between the training samples and the testing samples. 

It is important to pay attention to the gap between the converged values. We often 
assume that the training samples and the testing samples are drawn from the same distri- 
bution, and therefore the training samples are good representatives of the testing samples. 
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X_ Training Error 
1.15 7 © Testing Error 


Error 
= 


0.85 


0.8 
10! 10? 10° 
Number of training samples, N 


Figure 7.14: The learning curve is a pair of functions representing the training error and the testing 
error. As N increases we expect the training error to increase and the testing error to decrease. The two 
functions will converge to the same value as N goes to infinity. If they do not converge to the same 
value, there is an intrinsic mismatch between the training samples and the testing samples, e.g., the 
training samples are not representative enough for the dataset. 


If the assumption is not true, there will be a gap between the converged training error and 
the testing error. Thus, what you claim in training cannot be transferred to the testing. 
Consequently, the learning curve provides you with a useful debugging tool to check how 
well your training compares with your testing. 


Closing remark. In this section we have studied a very important concept in regression, 
overfitting. We emphasize that overfitting is not only caused by the complexity of the model 
but a combination of the three factors 07, N, and d. We close this section by summarizing 
the causes of overfitting: 


What is the source of overfitting? 


e Overfitting occurs because you have an imbalance between o?, N and d. 


e Selecting the correct complexity for your model is the key to avoid overfitting. 


7.3. Bias and Variance Trade-Off 


Our linear analysis has provided you with a rough understanding of what we experience in 
overfitting. However, for general regression problems where the models are not necessarily 
linear, we need to go deeper. The goal of this section is to explain the trade-off between 
bias and variance. This analysis requires some patience as it involves many equations. We 
recommend skipping this section on a first reading and then returning to it later. 
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If it is your first time reading it, we recommend you go through it slowly. 


7.3.1 Decomposing the testing error 


Notations 


As we did at the beginning of Section 7.2, we consider a ground truth model that relates 
an input z and an output y: 
y= f(x) +e, 

where e ~ Gaussian(0, 07) is the noise. For example, if we use a linear model, then f could 
be f(a) = 07 a, for some regression coefficients 0. 

During training, we pick a prediction model gg(-) and try to predict the output when 
given a training sample z: 

Y = 9o(@). 

For example, we may choose gg(x) = 67 x, which is also a linear model. We may also choose 
a linear model in another basis, e.g., ga(a) = 07 ¢(a) for some transformations ¢(-). In any 
case, the goal of training is to minimize the training error: 


N 
eS : 1 
0 =argmin 5, S= (go(@n) — Yn)” 5 
n=1 
where the sum is taken over the training samples Dirain = {(X1, y1),---; (@n,yn)}. Because 


the model parameter 6 is learned from the training dataset Dtrain, the prediction model 
depends on Ptrain- To emphasize this dependency, we write 


g'Pein) = the model trained from {(er Y1),-+-;(2N, ny}. 
During testing, we consider a testing dataset Drest = {(x1,y4),---, (2x, Yhr) }. We put 
these testing samples into the trained model to predict an output: 
7, = g!P) (x! ), m=1,...,M. (predicted value) 


Since the goal of regression is to make g'?*i») as close to f as possible, it is natural to 
expect Yj, to be close to yj. 


Testing error decomposition (noise-free) 


So we can now compute the testing error — the error that we ultimately care about. In the 
noise-free condition, i.e., e = 0, the testing error is defined as 


Drain * train 2 
EQ) = Ba | (GP (2') — f@"))”| (7.23) 
1 M 2 
ayo (Dtraing) (ap! )\ _ ! 
a7 3, (9rd — Fle) 


There are several components in this equation. First, x’ is a testing sample drawn from a 
certain distribution. You can think of Djes_ as a finite subset drawn from this distribution. 
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Second, the error (g(?#*") (a) — f (w’))? measures the deviation between our predicted value 
and the true value. Note that this error term is specific to one testing sample x’. Therefore, 
we take expectation E,, to find the average of the error for the distribution of a’. 

The testing error €,, Prsein) is a function that is dependent on the training set Drain, 
because the model g (Deeain) is trained from Dtrain. Therefore, as we change the training 
set, we will have a different model g and hence a different testing error. To eliminate the 
randomness of the training set, we define the overall testing error as 


= (Dtrain) 
Etest = 2D train ES 


= Ban Bx [( (2!) — ste!))] . (7.24) 


Note that this definition of the testing error is consistent with the special case in Equa- 
tion (7.18), in which the testing error involves a joint expectation over e and e’. The ex- 
pectation over e accounts for the training samples, and the expectation over e’ accounts for 
the testing samples. 

Let us try to extract some meaning from the testing error. Our method will be to 
decompose the testing error into bias and variance. 


Theorem 7.5. Assume a noise-free condition. The testing error of a regression prob- 
lem is given by 


Evest = Ex’ | (G(a") — f(#'))? + Ediainl gO (#') — G(@'))"I}, (7.25) 
——— a A 


=bias(x’) =var(a!) 


where g(a! Dinan [g6Perain) (x’)]. 


Proof. To simplify our notation, we will drop the subscript “train” in Dirain when the 
context is clear. We have that 


Etest = Ep Len! lo (a") = f(a'))?]| 
= Be [Eo[(o(@') - sr’) 


Continuing the calculation, 
Erest = Ex’ [Ep|(9?)(a’) — g(a’) + 9a’) - Fla’))"]| 
= By [Bo [(?(@') — 2')] + 28 [(o(e!) -H(e')Ce') - Fle’) 


+Bp[((e') - se) ]. 


Since g(a’) e ’p(g\P) (a’)], it follows that 


2Ep| (9 (a') — He’) )(G(w’) - F(w'))] = 0 
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because g(x’) — f(x’) is independent of D, and 


Therefore, 


Eso = Ba Bo [(a(@’) - a(@')?] + [cae’) - #00")? 


Thus, by defining two following terms we have proved the theorem. 


Let’s consider what this theorem implies. This result is a decomposition of the testing 
error into bias and variance. It is a universal result that applies to all regression models, 
not only linear cases. To summarize the meanings of bias and variance: 


What are bias and variance? 


e Bias = how far your average is from the truth. 


e Variance = how much fluctuation you have around the average. 


Figure 7.15 gives a pictorial representation of bias and variance. In this figure, we 
construct four scenarios of bias and variance. Each cross represents the predictor g(Piin) , 
with the true predictor f at the origin. Figure 7.15(a) shows the case with a low bias and 
a low variance. All these predictors g(?*) are very close to the ground truth, and they 
have small fluctuations around their average. Figure 7.15(b) shows the case of a high bias 
and a low variance. It has a high bias because the entire group of g(P#*™) is shifted to the 
corner. The bias, which is the distance from the truth to the average, is therefore large. The 
variance remains small because the fluctuation around the average is small. Figure 7.15(c) 
shows the case of a low bias but high variance. In this case, the fluctuation around the 
average is large. Figure 7.15 shows the case of high bias and high variance. We want to 
avoid this case. 


—~ wo 
3 23} AW) 


Bias low Bias high Bias low Bias high 
Var low Var low Var high Var high 


(a) (b) (c) (d) 


Figure 7.15: Imagine that you are throwing a dart with a target at the center. The four subfigures show 
the levels of bias and variance. 
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Testing error decomposition (noisy case) 


Let us consider a situation when there is noise. In the presence of noise, the training and 
testing samples will follow the relationship 


y= f(x) +e, 


where e ~ Gaussian(0,07). We assume that the noise is Gaussian to make the proof easier. 
We can consider other types of noise in theory, but the theoretical results will need to be 
modified. 

In the presence of noise, the testing error is 


2 
Evest 2") ED aine | (gPe)(a!) — F(a’) +e) | 


= Bane] (9° (0!) ale!) + 91a!) — fle!) +e)’ ], 


where we take the joint expectation over the training dataset Dt;ain and the error e. Con- 
tinuing the calculation, and using the fact that Diyain and e are independent (and E[e] = 0), 
it follows that 


Ese!) = Epcane| (giPe")(0") gle!) +(e") — fle!) +e) | 


= ED van | gx) (a!) — g(a')) | + (a(2") - fle’) +B-[e?]. 
ee 


=var(2’) =bias(a’) =noise 


Taking the expectation of x’ over the entire testing distribution gives us 
g p g g 


Etest = Ex’ [Etest(x”)|] = Ex [var(2’)] + Ex [bias(x’)] + 07. 
—_—_E ae '~* S 
var bias 
The theorem below summarizes the results: 


Theorem 7.6. Assume a noisy condition where y = f(x)+e for some 1.1.d. Gaussian 
noise e ~ Gaussian(0,07). The testing error of a regression problem is given by 


aa a’) =f @'y?| WE [Eevanl(gm(e' —g(2z'))"]| +07, (7.26) 


=bias(x’) =var(x’) 


where G(x") S Ed pan(g Pr (a’)]. 


7.3.2 Analysis of the bias 


Let us examine the bias and variance in more detail. To discuss bias we must first understand 
the quantity 


G(x") S EDs sain [ge (w’))], (7.27) 
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which is known as the average predictor. The average predictor, as the equation suggests, is 
the expectation of the predictor g(P#*™), Remember that gi) is a predictor constructed 
from a specific training set Di;ain. If tomorrow our training set Dt;ain contains other data 
(that come from the same underlying distribution), g(?*) will be different. The average 
predictor g is the average across these random fluctuations of the dataset Dtrain. Here is an 
example: 

Suppose we use a linear model with the ordinary polynomials as the bases. The data 
points are generated according to 


d-1 

Yn= > Opa +en. (7.28) 
=0 
ees, 


Sf(e.)=07 en 

If we use a particular training set Dtyain and run the regression, we will be able to obtain 
one of the regression lines, as shown in Figure 7.16. Let us call this line g“). We repeat the 
experiment by drawing another dataset, and call it g°?). We continue and eventually we will 
find a set of regression lines g“), g®),...,g®), where K denotes the number of training sets 
you are using to generate all the gray curves. The average predictor g is defined as 


K 
: 1 
32") = Ep, .l9™)] & = gM’). 
k=1 


Thus if we take the average of all these gray curves we will obtain the average predictor, 
which is the red curve shown in Figure 7.16. 


3 


“4 ; | 
-1 -0.5 0 0.5 1 


Figure 7.16: We run linear regression many times for different training datasets. Each one consists of 
different random realizations of noise. The gray curves are the regression lines returned by each of the 
training datasets. We then take the average of these gray curves to obtain the red curve, which is the 
average predictor. 


If you are curious about how this plot was generated, the MATLAB and Python codes 
are given below. 


MATLAB code to visualize the average predictor | 
N = 20; 
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a = [5.7, 3.7, -3.6, -2.3, 0.05]; 
x = linspace(-1,1,N); 
yhat = zeros(100,50) ; 


for i=1:100 
X = [x(:).70, x(:).71, x(:).72, x(€:).73, x(€:).°4]; 
y = X*a(:) + 0.5*randn(N,1); 
theta = X\y(:); 
t = linspace(-1, 1, 50); 


yhat(i,:) = theta(1) + theta(2)*t(:) + theta(3)*t(:).°2 ... 
+ theta(4)*t(:)*3 + theta(5)*t(:).74; 
end 
figure; 
plot(t, yhat, ’color’, [0.6 0.6 0.6]); hold on; 
plot(t, mean(yhat), ’LineWidth’, 4, ’color’, [0.8 0 0]); 
axis([-1 1 -2 2]); 


import numpy as np 

import matplotlib.pyplot as plt 

from scipy.special import eval_legendre 
np.set_printoptions(precision=2, suppress=True) 


= 20 
np.linspace(-1,1,N) 
np.array([0.5, -2, -3, 4, 6]) 
yhat = np.zeros((50,100)) 
for i in range(100): 


y = alO] + alt]*x + a[2]*x**2 + \ 
a[3]*x**3 + a[4]*x**4 + 0.5*np.random.randn(N) 
X = np.column_stack((np.ones(N), x, x**2, x**3, x**4)) 


theta = np.linalg.1lstsq(X, y, rcond=None) [0] 
t = np.linspace(-1,1,50) 
Xhat = np.column_stack((np.ones(50), t, t**2, t**3, t**4)) 
yhat[:,i] = np.dot(Xhat, theta) 
plt.plot(t, yhat[:,i], c=’gray’) 
plt.plot(t, np.mean(yhat, axis=1), c=’r’, linewidth=4) 


We now show an analytic calculation to verify Figure 7.16. 
Example 7.4. Consider a linear model such that 
y=ax' O+e. (7.29) 


What is the predictor g(P) (a’)? What is the average predictor 7(a’)? 


Solution. First, consider a training dataset Dirain = {(@1,y1),---,(@n,yn)}. We 
assume that the z,,’s are deterministic and fixed. Therefore, the source of randomness 
in the training set is caused by the noise e ~ Gaussian(0, 07) and hence by the noisy 
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observation y. 
The training set gives us the equation y = X@+ e, where X is the matrix 
constructed from 2,,’s. The regression solution to this dataset is 


0 = (X7X)'X7y, 
~(Dtrain 


which should actually be 0 because y is a dataset-dependent vector. 
Consequently, 


ea (all? (x x) xy 
(XX ree xe ec) 
CAND. Ga> @ ig? Gee 


Since the randomness of Dt;ain is caused by the noise, it follows that 

G(x") = Ep, ain |g P**™ (#")] = Ee[(x’)70 + (a’)? (X7X) 1X7 e] 
+ (a’)"(X7X)* XE. el 
= (2')764+0= f(a’). 


So the average predictor will return the ground truth. However, note that not all 
predictors will return the ground truth. 


In the above example, we obtained an interesting result, namely that g(a’) = f(x’). 
That is, the average predictor equals the true predictor. However, in general, g(x’) does 
not necessarily equal f(a’). If this occurs, we have a deviation (g(a’) — f(x’))? > 0. This 
deviation is called the bias. Bias is independent of the number of training samples because 
we have taken the average of the predictors. Therefore, bias is more of an intrinsic (or 
systematic) error due to the choice of the model. 


What is bias? 


e Bias is defined as bias = E,z/|(g(x’) — f(x’))?], where z’ is a testing sample. 


e It is the deviation from the average predictor to the true predictor. 


e Bias is not necessarily a bad thing. A good predictor can have some bias as long 
as it helps to reduce the variance. 


7.3.3 Variance 


The other quantity in the game is the variance. Variance at a testing sample a’ is defined 
as 


var(at’) Ep, ain (ge (a’) — 9(x"))?]. (7.30) 


As the equation suggests, the variance measures the fluctuation between the predictor 
g'Pan) and the average predictor g. Figure 7.17 illustrates the polynomial-fitting prob- 
lem we discussed above. In this figure we consider two levels of variance by varying the 
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noise strength of e,. The figure shows that as the observation becomes noisier, the predictor 
gPein) will have a larger fluctuation for the average predictor. 


2 


2 


(a) small variance (b) large variance 


Figure 7.17: Variance measures the magnitude of fluctuation between the particular predictor gPrsin) 


and the average predictor g. 
Example 7.5. Continuing with Example 7.4, we ask: What is the variance? 


Solution. We first determine the predictor and its average: 


go ae Xx) xX ye (XX) Xe 
G = Elg'?*")] = E.[0 + (X7X) 1X7 e] =0, 


so the prediction at a testing sample 2’ is 
gPrm) a) = (w+ (@'(KTX) IX" 
g(x") = (x')"8, 


Consequently, the variance is 


Dan |(g~(@") —a1e@)) | = Bel ((W")"0 + (@ (XT X)1XTe — (278) | 


| ((@)T (XTX) XT) I 


Continuing the calculation, 


Dc (a) = a(2!)) | = (w')7(X™X)1X7E, [eeT]X (XX) 1a! 
= (a)? (XX) Xt eI xX (XxX? X) 12" 
= a (a!) (XX? Xx) a! 

o'tr{ (XTX) Ma')(a') }. 
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What will happen if we use more samples so that N grows? As N grows, the matrix X will 
have more rows. Assuming that the magnitude of the entries remains unchanged, more rows 
in X will increase the magnitude of X7_X because we are summing more terms. Consider 
a 2 x 2 ordinary polynomial system where 


N N 
rien Donate 
se Tn, N 
As N grows, all the entries in the matrix grow. As a result, (X7X)~! will shrink in mag- 


nitude and thus drive the variance oPtr{ (X7X)(a)(a')" } to zero. 


What is variance? 


e Variance is the deviation between the predictor gi») and its average J. 


e It can be reduced by using more training samples. 


7.3.4 Bias and variance on the learning curve 


The decomposition of the testing error into bias and variance is portrayed visually by the 
learning curve shown in Figure 7.18. This figure shows the testing error and the training 
error as functions of the number of training samples. As N increases, we observe that both 
testing and training errors converge to the same value. At any fixed N, the testing error is 
composed of bias and variance: 


e The bias is the distance from the ground to the steady-state level. This value is fixed 
and is a constant w.r.t. N. In other words, regardless of how many training samples 
you have, the bias is always there. It is the best outcome you can achieve. 


e The variance is the fluctuation from the steady-state level to the instantaneous state. 
It drops as N increases. 


: Testing Error 
Variance 


Number of training samples 


Figure 7.18: The learning curve can be decomposed into the sum of the bias and the variance. The bias 
is the testing error when N = oo. For finite N, the difference between the testing error and the bias is 
the variance. 
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Figure 7.19 compares the learning curve of two models. The first case requires us to 
fit the data using a simple model (marked in purple). The training error and the testing 
error have small fluctuations around the steady-state because, for simple models, you need 
only a small number of samples to make the model happy. The second case requires us to 
fit the data using a complex model (marked in green). This set of curves has a much wider 
fluctuation because it is harder to train and harder to generalize. However, when we have 
enough training samples, the training error and the testing error will converge to a lower 
steady-state value. Therefore, you need to pay the price of using a complex model, but if 
you do, you will enjoy a lower testing error. 


2.5 ; 
Re ¢ & ¢/=ss= Simple Model - Training Error 
: —6— Simple Model - Testing Error 
i rc Complex Model - Training Error} | 
—4— Complex Model - Testing Error 


Error 


10! 10? 10° 
Number of training samples, N 


Figure 7.19: The generalization capability of a model is summarized by the training and testing errors 
of the model. If we use a simple model we will have an easier time with the training but the steady-state 
testing error will be high. In contrast, if we use a complex model we need to have a sufficient number 
of training samples to train the model well. However, when the complex model is well trained, the 
steady-state error will be lower. 


The implication of all this is that you should choose the model by considering the 
number of data points. Never buy an expensive toy when you do not have the money! If 
you insist on using a complex model while you do not have enough training data, you will 
suffer from a poor testing error even if you feel good about it. 


Closing remark. We close this section by revisiting the bias-variance trade-off: 


Ereat = Ex’ a’) — 100’)? 7" [Baal (0!) —I(0")) +o. (7.31) 
oS ak eo ee) 
=bias(a’) =var(x’) 


The relationship among the three terms is summarized below: 


What is the trade-off offered by the bias-variance analysis? 


e Overfitting improves if N +: Variance drops as N grows. Bias is unchanged. 


e Overfitting worsens if 0? +. If training noise grows, gP*") will have more fluc- 
tuations, so variance will grow. If testing noise grows, e? grows. 
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| e Overfitting worsens if the target function f is too complicated to be approximated | 
by g. 


End of the section. Please join us again. 


7.4 Regularization 


Having discussed the source of the overfitting problem, we now discuss methods to allevi- 
ate overfitting. The method we focus on here is regularization. Regularization means that 
instead of seeking the model parameters by minimizing the training loss alone, we add a 
penalty term to force the parameters to“behave better”. As a preview of the technique, we 
change the original training loss 


N d—-1 2 
Etrain(9) _ S- (1 _ Yo Prép(tn) ; (7.32) 


p=0 


data fidelity 


which consists of only the data fidelity term, to a modified training loss 


N d—1 2 d-1 
Etrain (9) = S- (u, _ S- 6,690») ) + A: S> 0° i (7.33) 
n=1 p=0 p=0 
—$ 
F(@), data fidelity A- R(@), regularization 


Putting this into the matrix form, we define the data fidelity term as 
F(@) =||X0— yl. (7.34) 

The newly added term R(@) is called the regularization function or the penalty function. 
It can take a variety of forms, e.g., 

e Ridge regression: R(@) = ya 03 = ||@||?. 

e LASSO regression: R(@) = 3>°~5 |p| = ||4|I1. 
In this section we aim to understand the role of the regularization functions by studying 
these two examples of R(@). 


7.4.1 Ridge regularization 
To explain the meaning of Equation (7.33) we write it in terms of matrices and vectors: 


minimize ||X@— y||? + A\/6||’, (7.35) 
OcR¢ 


where A is called the regularization parameter.It needs to be tuned by the user. We refer 
to Equation (7.35) as the ridge regression.° 


5TIn signal processing and optimization, Equation (7.35) is called the Tikhonov regularization. We follow 
the statistics community in calling it the ridge regression. 
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How can the regularization function help to mitigate the overfitting problem? First 
let’s find the solution to this problem. 


Practice Exercise 1. Prove that the solution to Equation (7.35) is 
6=(X™X4AD1XTy. 


Solution. Take the derivative with respect to @.% This yields 


Vo{ x8 =) a aja} = 9X (X@—y) + 2.6 —0. 


Rearranging the terms gives 
(X?X+dDO= XT y. 


Taking the inverse of the matrix on both sides yields the solution. 


*The solution here requires some basic matrix calculus. You may refer to the University of Water- 
loo’s Matrix Cookbook https: //www.math.uwaterloo.ca/~hwolkowi/matrixcookbook. pdf. 


Let us compare the ridge regression solution with the vanilla regression solutions: 
Ovanitla = (X7X)1XTy, 
Otic (XX PAT OX yy. 
Clearly, the only difference is the presence of the parameter : 
e If \ > 0, then Oviage(0) — Oa: This is because 
Etrain(9) = ||XO — yl]? + Allall?. 
“a5 


Hence, when A — 0, the regression problem goes back to the vanilla version, and so 
does the solution. 


e \> oo, then Byiage (00) = (0. This happens because 


1 
Etrain (9) = x l-xe a yl? =e \|6||?. 
e——_s»_“—_’ 
=0 


Since we are now minimizing ||@||?, the solution will be @ = 0 because zero is the 

smallest value a squared function can achieve. 

For any 0 < A <0, the net effect of (XX + AJ) is the constant \ added to all the 
eigenvalues of X TX. By taking the eigendecomposition of X ax, 

[U, S] = eig(X7 X), 
we have that 
X?X +I =USU" +41 
=USU" +)\UU? =U(S+ ADU". 
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Therefore, if the eigenvalue matrix S has a zero eigenvalue it will be offset by A: 


& m+ 
ga)’ © 4, So = wn 


a a+A 


0 r 


As a result, even if X"X is not invertible (or close to not invertible), the new matrix 
X?X + XI is guaranteed to be invertible. 


Practice Exercise 2. You may be wondering what happens if X7 X has a negative 
eigenvalue so that when we add a positive A, the resulting matrix may have a zero 
eigenvalue. Prove that X7X will never have a negative eigenvalue, and X7_X + AI 
always has positive eigenvalues. 


Solution. Eigenvalues of a matrix A are nonnegative if and only if v’ Av > 0 for 


any v. Thus we need to check whether v7 X ? Xv > 0 for all v. However, this is easy: 
uv? XT Xv = ||Xo|l’, 


which must be nonnegative for any v. Matrices satisfying this property are called 
positive semidefinite. Therefore, X aX is positive semidefinite. 


Implementation 


Solving the ridge regression is easy. First, we observe that the regularization function R(@) = 
\|O||? is a quadratic function. Therefore, it can be combined with the data fidelity term as 
6 =argmin ||X6— yl? + A|\6||? 
OcR4 
=argmin ||X6— y||? + ||\VAT6 — 0||? 
OcR4 


i xX y 
= argmin O- 
gar |varle- [| 
Therefore, all we need to do is to concatenate the matrix X with ad x d identity operator 
VAI, and concatenate y with a dx 1 all-zero vector. 


In MATLAB and Python, the implementation of the ridge regression is done by defining 
a new matrix A and a new vector b, as shown below: 


2 


MATLAB command for ridge regression 
= [X; sqrt (lambda) *eye(d)]; 
[y(:); zeros(d,1)]; 
theta = A\b; 


% MATLAB command for ridge regression 

A = np.vstack((X, np.sqrt(lambd) *np.eye(d))) 
b = np-hstack((y, np.zeros(d))) 

theta = np.linalg.1stsq(A, b, rcond=None) [0] 
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Example 7.6. Consider a dataset of N = 20 data points. These data points are 
constructed from the model 


Ym = 0.5 — 2a, 2 Ag? + Orn en, n=1,... 


where e,, ~ Gaussian(0, 0.257) is the noise. Fit the data using 


(a) Vanilla linear regression with a 4th-order polynomial. 

(b) Vanilla linear regression with a 20th-order polynomial. 

(c) Ridge regression with a 20th-order polynomial, by considering three choices of 2: 
Ke Ne. = de? cand 10: 


Solution. 


(a) We first fit the data using a 4th-order polynomial. This fitting is relatively 
straightforward. In the MATLAB / Python programs below, set d = 4 and 
= 0. The result is shown in Figure 7.20(a). 


O data O data 
=== fitted curve ° re) === fitted curve 


05 0 0.5 ce 05 0 0.5 
a) Vanilla, 4th-order polynomial (b) Vanilla, 20th-order polynomial 


Figure 7.20: Overfitting occurs when the model is too complex for the number of training samples. 
When using a vanilla regression with a 20th-order polynomial, the curve overfits the data and 
causes a catastrophic fitting error. 


(b) Suppose we use a 20th-order polynomial g(x) = Se aa 6,x” to fit the data. We 


plot the result in Figure 7.20(b). Since the order of the polynomial is very high 
relative to the number of training samples, it comes as no surprise that the fitting 
is poor. This is overfitting, and we know the reason. 


Next, we consider a ridge regression using three choices of A. The result is shown 
in Figure 7.21. If \ is too small, we observe that some overfitting still occurs. If 
d is too large, then the curve underfits the data. For an appropriately chosen X, 
it can be seen that the fitting is reasonably good. 
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© data © data © data P00 
=== fitted curve === fitted curve 


=== fitted curve 


0° 0° 
0.5 0 0.5 - -0.5 0 0.5 


(a) Ridge, A = 10-® (b) Ridge, \ = 107% (c) Ridge, \ = 10 


Figure 7.21: Ridge regression addresses the overfitting problem by adding a regularization term 
to the training loss. Depending on the strength of the parameter A, the fitted curve can vary from 
overfitting to underfitting. 


The MATLAB and Python codes used to generate the above plots are shown below. 


MATLAB code to demonstrate a ridge regression example 
Generate data 
= 20; 
linspace(-1,1,N); 
[0.5, -2, -3, 4, 6]; 
a(1)+a(2) *x(:)+a(3)*x(:).72+a(4) *x(:).73+a(5)*x(:).744+0.25*randn(N,1); 


Ridge regression 
lambda = 0.1; 
d = 20; 
zeros(N, d); 
for p=0:d-1 
X(:,ptl) = x(:)."p; 
end 


A = [X; sqrt (lambda) *eye(d)]; 
b = [y(:); zeros(d,1)]; 
theta = A\b; 


% Interpolate and display results 
t linspace(-1, 1, 500); 
Xhat = zeros(length(t), d); 
for p=0:d-1 
Xhat(:,pt1) = t(:).7p; 
end 
yhat = Xhat*theta; 
plot(x,y, ?ko’,’LineWidth’,2, ’MarkerSize’, 10); hold on; 
plot(t,yhat,’LineWidth’ ,4,’Color’,[0.2 0.2 0.9]); 


# Python code to demonstrate a ridge regression example 
import numpy as np 

import matplotlib.pyplot as plt 

from scipy.special import eval_legendre 
np.set_printoptions(precision=2, suppress=True) 
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N = 20 
x = np.linspace(-1,1,N) 
a = np.array([0.5, -2, -3, 4, 6]) 
y = alO] + a[i]*x + a[2]*x**2 + \ 
a[3]*x**3 + a[4]*x**4 + 0.25*np.random.randn(N) 

d = 20 
X = np.zeros((N, d)) 
for p in range(d): 

XC: ,p] = x**p 


lambd = 0.1 

A = np.vstack((X, np.sqrt(lambd) *np.eye(d))) 
b = np-hstack((y, np.zeros(d))) 

theta = np.linalg.1lstsq(A, b, rcond=None) [0] 


t = np.linspace(-1, 1, 500) 
Xhat = np.zeros((500,d)) 
for p in range(d): 
Xhat[:,p] = t**p 
yhat = np.dot(Xhat, theta) 


plt.plot(x,y,’o’ ,markersize=12) 
plt.plot(t,yhat, linewidth=4) 
plt.show() 


Why does ridge regression work? 


e The penalty term ||6||? in 


Oviage = argmin ||X6— yl? + Alj6l|? 
OER?Z 


does not allow solutions with very ||6||?. 
e The penalty term adds a positive offset to the eigenvalues of X Xe 


e Since the denominator in (X7X + \I)~!X7y becomes larger than that of 
(X7X)-!XTy, noise in y is less amplified. 


Choosing the parameter 


How should we choose the parameter A? The honest answer is that there is no answer 
because the optimal » can only be found if we have access to the testing samples. If we do, 
we can plot the MSE (the testing error) with respect to A, as shown in Figure 7.22(a). 

Of course in reality we do not have access to the testing data. However, we can reserve 
a small portion of the training samples and treat them as validation samples. Then we 
run the ridge regression for different choices of \. The \ that minimizes the error on these 
validation samples is the one that you should deploy. If the training set is small, we can 
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10" 
(ea) 
S107 
-3 L 1 1 f 
Mo" 10° 10° 10° a 10° 
d 
(a) Testing error vs A (b) F(0) vs R(Ox) 


Figure 7.22: (a) Determining the optimal \ requires knowledge of the testing samples. In practice, we 
can replace the testing samples with the validation samples, which are subsets of the training data. Then 
by plotting the validation error as a function of \ we can determine the optimal \. (b) The alternative 
is to plot F(y) versus R(O). The optimal A can be found by locating the elbow point. 


shuffle the validation samples randomly and compute the average. This scheme is known as 
cross-validation. 

For some problems, there are “tactics” you may be able to employ for determining 
the optimal A. The first approach is to ask yourself what would be the reasonable range 
of ||6||? or ||X@ — y||?? Are you expecting them to be large or small? Approximately in 
what order of magnitude? If you have some clues about this, then you can plot the function 

F(0)) = = ||X@) — y|l? as a function of R(@) = ||6y||?, where @) is a shorthand notation 
for Oriage(A), which is the estimated parameter using a specific value of A. Figure 7.22(b) 
shows an example of such a plot. As you can see, by varying \ we have different values of 

F(@)) and R(,). 

If you have some ideas about what ||@ ||? should be, say you want \|O||? < 7, you can go 
to the F(@y) versus R(0y) curve and find a point such that R(0y) <r. On the other hand, 
if you want || X60 — y||? < «, you can also go to the F(@)) versus R(Ox) curve and find a 
point such that || X60 — y||? < . In either case, you have the freedom to shift the difficulty 
of finding \ to that of finding 7 or e. Note that 7 and € have better physical interpretations. 
The quantity ¢€ tells us the upper bound of the prediction error, and 7 tells us the upper 
bound of the parameter magnitude. If you have been working on your dataset long enough, 
the historical data (and your experience) will help you determine these values. 

Another feasible option suggested in the literature is finding the anchor point of the 
F (6 ») and RO y). The idea is that if the curve has a sharp elbow, the turning point would 
indicate a rapid increase/decrease in F(0) (or R(@,)). 


How to determine \ 


e Cross-validation: Reserve a few training samples as validation samples. Check 


the prediction error w.r.t. these validation samples. The » that minimizes the 
validation error is the one you deploy. 


e ||0\|2 < 7: Plot the F(@,) and R(O,). Then go along the R-axis to find the 
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position where R(Ox) Ge 


e ||X0— yl? < 7: Plot the F(@,) and R(@)). Then go along the F-axis to find 
the position where F'(@) < e. 


e Find the elbow point of F(0) and R(O,). 


Bias and variance trade-off for ridge regression 


We now discuss the bias and variance trade-off of the ridge regression. 


Theorem 7.7. Let y= X0+e be the training data, where e is zero-mean and has a 
covariance o?I. Consider the ridge regression 


6, = argmin ||XO— yl? + A\/ ||”. (7.37) 
OER4 


Then the estimate has the properties that 

OX RN ke KOE Xe 
[0x] = (X7X +1) 1X7 XO=W)8, 
CovlOx(=o- (XX eX? Xe XE Anes 


MSE(@,, 0) = 0? TH Wy(xTx)"Ww5 + 07(W,—1)7(W, — 18, 


where Wy =(X™X +21)! X?X. 


Proof. The proof of this theorem involves some tedious matrix operations that will be 
omitted here. If you are interested in the proof you can consult van Wieringen’s “Lecture 
notes on ridge regression”, https://arxiv.org/pdf/1509.09169.pdf. 


The results of this theorem provide a way to assess the bias and variance. Specifically, 
from the MSE we know that 


MSE(6,,6) = Ee [|| — 4]|?| 


= ||E- [Oy] — 6||? + Tr { Cov(,]} 
= 6"(W,-—I)"(W) —1)e+ o'r Wa XTX) WI, 
oq q~ 


bias —_—_—_—X—_—“<__a_- 


variance 


The bias and variance are defined respectively as 
Bias(@,0) = 67(W) — I)" (Wy — 1), 
Var(,0) = ote Wa XTX) Ww, 


We can then plot the bias and variance as a function of 4. An example is shown in Fig- 
ure 7.23. 
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Figure 7.23: The bias and variance of the ridge regression behave in opposite ways as increases. The 
MSE is the sum of bias and variance. 


The result in Figure 7.23 can be summarized in three points: 


e Bias + as \ t. This is because a large \ pushes the solution towards 0 = 0. Therefore, 
the bias with respect to the ground truth @ will increase. 


e Variance | as » +. Since variance is caused by noise, increasing forces the solution 
0 to be small. Hence, it becomes less sensitive to noise. 


e MSE reaches a minimum point somewhere in the middle. The MSE is the sum of bias 
and variance. Therefore, it drops to the minimum and then rises again as \ increases. 


With appropriate choice of A, we can show that the ridge regression can have a 
lower mean squared error than the vanilla regression. The following result is due to C. M. 
Theobald:® 


Theorem 7.8. For \ < 207||@||~?, 


MSE (Griage(): 0) < MSE (Gere 0) 


This theorem says that as long as A is small enough, the ridge regression will have a lower 
MSE than the vanilla regression. Thus ridge regression is almost always helpful. Of course, 
the optimal A is not provided by the theorem, which only tells us where to search for a 
good X. 


Why does ridge regression reduce the testing error? 


e The regularization reduces the variance (see Figure 7.23 when \ > 0) 


e It pays the price of increasing the bias. 


6Theobald, C. M. (1974). Generalizations of mean square error applied to ridge regression. Journal of 
the Royal Statistical Society. Series B (Methodological), 36(1), 103-106. 


448 


7.4. REGULARIZATION 


e Usually, the drop in variance outweighs the increase in bias. So the overall MSE 
drops. 


e Bias is not always a bad thing. 


7.4.2  LASSO regularization 


The ridge regression we discussed in the previous subsection is just one of the many possible 
ways of doing regularization. One alternative is to replace ||@||? by ||@||;, where 


d—1 


(ll: = _ 9p. (7.39) 


p=0 


This change from the sum-squares to sum-absolute-values has been main driving force in 
data science, machine learning, and signal processing for at least the past two decades. The 
optimization associated with ||6]|1 is 


minimize ||X0— y||? + AllA|l1, (7.40) 
OcR¢4 


or 


N d-1 2 d-1 
Etrain(@) = S- (u, - Y Pép(en) + >: S- Op] (7.41) 


n=1 p=0 p=0 
——_——--—— 
F(@), data fidelity A-R(@), regularization 
Seeking a sparse solution 
To understand the choice of || - ||1, we need to introduce the concept of sparsity. 


Definition 7.1. A vector @ is called sparse if it has only a few non-zero elements. 


As illustrated in Figure 7.24, a sparse @ ensures that only a very few columns of the data 
matrix X are active. This is an attractive property because, in some of the regression 
problems, it is indeed possible to have just a few dominant factors. The LASSO regression 
says that if our problem possesses this sparse solution, then the || - ||; can help us find the 
sparse solution. 


im 
im 
| a | 
| 
Ll jy x 6 


Figure 7.24: A vector @ is sparse if it only contains a few non-zero elements. If @ is sparse, then the 
observation y is determined by a few active components. 
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How can ||@||; promote sparsity? If we consider the sets 


Q1 = {6 | [All <7} = {(A1, 2) | |O1| + [02] < rH, 
OQ = {8 | |All? <7} = {(O1, 02) | OF + 03 <TH, 


we note that ],; has a diamond shape whereas Q2 has a circular shape. Since the data 
fidelity term ||X@ — y||? is an ellipsoid, seeking the optimal value in the presence of the 
regularization term can be viewed as moving the ellipsoid until it touches the set defined 
by the regularization. As illustrated in Figure 7.25, since {@ | ||@||? < 7} is a circle, the 
solution will be somewhere in the middle. On the other hand, since {0 | ||O||, < 7} isa 
diamond, the solution will be one of the vertices. The difference between “somewhere in 
the middle” and “a vertex” is that the vertex is a sparse solution, since by the definition of 
a vertex one coordinate must be zero and the other coordinate must be non-zero. We can 
easily extrapolate this idea to the higher-dimensional spaces. In this case, we will see that 
the solution for the |] - ||; problem has only a few non-zero entries. 


contour of F'(@) = 
vanilla 


OLasso 


smallest F'(@) 
while still feasible 


vertex 


feasible set: 
all solutions need 
to live here 


Figure 7.25: A vector @ is sparse if it contains only a few non-zero elements. If @ is sparse, then the 
observation y is determined by a few active components. 


The optimization formulated in Equation (7.41) is known as the least absolute shrink- 
age and selection operator (LASSO). LASSO problems are difficult, but over the past two 
decades we have increased our understanding of the problem. The most significant break- 
through is that we now have algorithms to solve the LASSO problem efficiently. This is 
important because, unlike the ridge regression, where we have a (very simple) closed-form 
solution, the LASSO problem can only be solved using iterative algorithms. 


What is so special about LASSO? 
e LASSO regularization promotes a sparse solution. 


e If the underlying model has a sparse solution, e.g., you choose a 50th-order 


polynomial, but the underlying model is a third-order polynomial, then there 
should only be three non-zero regression coefficients in your 50th-order polyno- 
mial. LASSO will help in this case. 
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e If the underlying model has a dense solution, then LASSO is of limited value. A 
ridge regression could be better. 


e While |/@||, is not differentiable (at 0), there exist polynomial-time convex algo- 
rithms to solve the problem, e.g., interior-point methods. 


Solving the LASSO problem 


Today, there are many open-source packages to solve the LASSO problem. They are mostly 
developed in the convex optimization literature. One of the most user-friendly packages is 
the CVX package developed by S. Boyd and colleagues at Stanford University.” Once you 
have downloaded and installed the package, solving the optimization can be done literally 
by typing in the data fidelity term and the regularization term. An example is given below. 


cvx_begin 
variable theta(d) 


minimize (sum_square(X*theta-y) + lambda*norm(theta,1)) 
cvx_end 


As you can see, the program is extremely simple. You start by calling cvx_begin 
and end it with cvx_end. Inside the box we create a variable beta(d), where d denotes 
the dimension of the vector theta. The main command is minimize. However, this line is 
almost self-explanatory. As long as you follow the syntax given by the user guidelines, you 
will be able to set it up properly. 

In Python, we can call the cvxpy library. 


import cvxpy as cvx 
theta = cvx.Variable(d) 
objective = cvx.Minimize( cvx.sum_squares(X*theta-y) \ 


+ lambd*cvx.normi(theta) ) 
prob cvx.Problem(objective) 
prob.solve() 


To see a concrete example, we use the crime rate data obtained from https://web. 
stanford. edu/~hastie/StatLearnSparsity/data.html. A snapshot of the data is shown 
in the table below. In this dataset, the vector y is the crime rate, which is the last column 
of the table. The feature/basis vectors are funding, hs, not-hs, college. 


city | crime rate | funding hs no-hs_ college 
1 478 40 74 11 31 
2 494 32 72 11 43 
3 643 57 71 18 16 
4 341 3l 71 11 25 
50 940 66 67 26 18 


7The MATLAB version is here: http://cvxr.com/cvx/. The Python version is here: https://cvxopt. 
org/. Follow the instructions to install the package. 


451 


CHAPTER 7. REGRESSION 


We consider two optimizations: 


_~ 


91(A) = argmin €,(8) SF XO — yll? + Alla, 


62(A) =argmin &(0) & 
6 


||XO— yl? + 


T 
~ 


6\|?. 


As we have discussed, the first optimization uses the || - ||, regularized least squares, which is 
the LASSO problem. The second optimization is the standard ||- ||? regularized least squares. 
Since both solutions depend on the parameter A, we parameterize the solutions in terms of 
A. Note that the optimal \ for 0; is not necessarily the optimal A for 02. 

One thing we would like to demonstrate in this example is visualizing the linear re- 
gression coefficients 0;(\) and @2(A) as A changes. To solve the optimization, we use CVX 
with the MATLAB and Python implementation is shown below. 


load(’./dataset/data_crime.txt’); 
data(:,1); %, The observed crime rate 
data(: ,3:end); /, Feature vectors 

[N,d]= size(X); 


lambdaset = logspace(-1,8,50); 
theta_store = zeros(d,50); 
for i=1:length(lambdaset) 
lambda = lambdaset (i); 
cvx_begin 
variable theta(d) 
minimize( sum_square(X*theta-y) + lambda*norm(theta,1) ) 
minimize( sum_square(X*theta-y) + lambda*sum_square(theta) ) 
cvx_end 
theta_store(:,i) = theta(:); 
end 


figure(1); 

semilogx(lambdaset, theta_store, ’LineWidth’, 4); 

legend(’funding’,’% high’, ’% no high’, ’% college’, 
»h graduate’, ’Location’,’NW’); 

xlabel (’? lambda’ ) ; 

ylabel(’feature attribute’); 


import cvxpy as cvx 
import numpy as np 
import matplotlib.pyplot as plt 


data = np.loadtxt("/content/data_crime.txt") 
y = datal:,0] 

X = datal[:,2:7] 

N,d = X.shape 


lambd_set = np.logspace(-1,8,50) 
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theta_store = np.zeros((d,50)) 
for i in range(50): 

lambd = lambd_set [i] 

theta = cvx.Variable(d) 


objective = cvx.Minimize( cvx.sum_squares(X*theta-y) \ 
+ lambd*cvx.normi(theta) ) 
# objective = cvx.Minimize( cvx.sum_squares(X*theta-y) \ 


+ lambd*cvx.sum_squares(theta) ) 
cvx.Problem(objective) 


prob 
prob.solve() 
theta_store[:,i] = theta.value 


for i in range(d): 
plt.semilogx(lambd_set, theta_store[i,:]) 


mms funding 
12 > | mmm % high 
=== % no high 
10 | | mmm % college 
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=== % no high 
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feature attribute 
Oo 

feature attribute 
o 


2 | | 2 I 
107 10° 107 10° 10° 10° 107 10° 107 104 10° 10° 
lambda lambda 


(a) LASSO (b) Ridge 


Figure 7.26: Ridge and LASSO regression on the crime-rate dataset. (a) The LASSO regression suggests 
that there are only a few active components as we change X. (b) The ridge regression returns a set of 
dense solutions for all choices of X. 


Figure 7.26 shows some interesting differences between the two regression models. 


e Trajectory. For the || - ||? estimate @2(A), the trajectory of the regression coefficients 
is smooth. This is attributable to the fact that the training loss €2(@) is continuously 
differentiable in 8, and so the solution trajectory is smooth. By contrast, the || - ||1 
estimate @;(X) has a more disruptive trajectory. 


e Active members. For the LASSO problem, 6,()) switches the active member as \ 
changes. For example, the feature high-school is the first one being activated when 
A |. This implies that if we limit ourselves to only one feature, then high-school is 
the feature we should select. The ridge regression does not have this feature-selection 
property. How about when \ = 106? In this case, the LASSO has two active members: 
funding and high-school. This suggests that if there are two contributing factors, 
funding and high-school are the two. As \ = 104, we see that in LASSO, the green 
curve goes to zero but then the red curve rises. This means a correlation between 
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high school and no high school, which should not be a surprise because they are 
complementary to each other. 


e Magnitude of solutions. The magnitude of the solutions does not necessarily convey 
a clear conclusion because the feature vectors (e.g., high school) and the observable 
crime rate have different units. 


e Limiting solutions. As \ > 0, both 6,()) and 65(X) reach the same solution, because 
the training losses are identical when \ = 0. 


LASSO for overfitting 


Does LASSO help to mitigate the overfitting problem? Not always, but it often does. In 
Figure 7.27 we consider fitting a dataset of N = 20 data points. The ground truth model 
we use is 


where e, ~ Gaussian(0,07) for ¢ = 0.25. When fitting the data, we purposely choose a 
20th-order Legendre polynomial as the regression model. With only N = 20 data points, we 
can be almost certain that there is overfitting. 

The MATLAB and Python codes for solving this LASSO problem are shown below. 


% MATLAB code to demonstrate overfitting and LASSO 
% Generate data 


N = 20; 

x = linspace(-1,1,N)’; 

a= [1, 0.5, 0.5, 1.5, 1]; 

y = a(1)*legendreP(0,x)+a(2)*legendreP(1,x)+a(3)*legendreP(2,x)+ ... 
a(4) *legendreP (3,x)+a(5)*legendreP(4,x)+0.25*randn(N, 1); 


% Solve LASSO using CVX 
d = 20; 
X = zeros(N, d); 
for p=0:d-1 
X(:,pt1) = reshape(legendreP(p,x) ,N,1); 
end 
lambda = 2; 
cvx_begin 
variable theta(d) 
minimize( sum_square( X*theta - y ) + lambda * norm(theta , 1) ) 
cvx_end 


% Plot results 
t = linspace(-1, 1, 200); 
Xhat = zeros(length(t), d); 
for p=0:d-1 
Xhat(:,p+1) = reshape(legendreP(p,t) ,200,1); 
end 
yhat = Xhat*theta; 
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plot(x,y, ?>ko’,’LineWidth’ ,2, ’MarkerSize’, 10); hold on; 
plot(t,yhat,’LineWidth’ ,6,’Color’,[0.2 0.5 0.2]); 


# Python code to demonstrate overfitting and LASSO 
import cvxpy as cvx 

import numpy as np 

import matplotlib.pyplot as plt 


Setup the problem 
20 
np.linspace(-1,1,N) 
np.array([1, 0.5, 0.5, 1.5, 1]) 
a[l0]*eval_legendre(0,x) + a[1]*eval_legendre(1,x) + \ 
a[2]*eval_legendre(2,x) + a[3]*eval_legendre(3,x) + \ 
a[4]*eval_legendre(4,x) + 0.25*np.random.randn(N) 


# Solve LASSO using CVX 
d = 20 
lambd = i 
X = np.zeros((N, d)) 
for p in range(d): 
XC:,p] = eval_legendre(p,x) 
theta = cvx.Variable(d) 
objective = cvx.Minimize( cvx.sum_squares(X*theta-y) \ 
+ lambd*cvx.normi(theta) ) 
prob cvx.Problem(objective) 
prob.solve() 
thetahat = theta.value 


# Plot the curves 

t = np.linspace(-1, 1, 500) 
Xhat = np.zeros((500,d)) 
for p in range(P): 

Xhat[:,p] = eval_legendre(p,t) 
yhat = np.dot(Xhat, thetahat) 
plt.plot(x, y, ’o’) 
plt.plot(t, yhat, linewidth=4) 


Let us compare the various regression results. Figure 7.27(b) shows the vanilla regres- 
sion, which as you can see fits the N = 20 data points very well. However, no one would 
believe that such a fitting curve can generalize to unseen data. Figure 7.27(c) shows the 
ridge regression result. When performing the analysis, we sweep a range of A and pick the 
value A = 0.5 so that the fitted curve is neither too “wild” nor too “flat”. We can see that 
the fitting is improved. However, since the ridge regression only penalizes large-magnitude 
coefficients, the fitting is still not ideal. Figure 7.27(d) shows the LASSO regression result. 
Since the true model is a 4th-order polynomial and we use a 20th-order polynomial, the true 
solution is sparse. Therefore, LASSO is helpful, and hence we can pick a sparse solution. 

The significance of LASSO is often not about the fitting of the data points but the 
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Figure 7.27: We fit a dataset of N = 20 data points. (a) The ground truth model that generates 
the data. The model is a 4th-order ordinary polynomial. (b) Vanilla regression result, without any 
regularization. Note that there is severe overfitting because the model complexity is too high. (c) Ridge 
regression result, by setting \ = 0.5. (d) LASSO regression result, by setting A = 2. 


number of active coefficients. In Figure 7.28 we show a comparison between the ground 
truth coefficients, the vanilla regression coefficients, the ridge regression coefficients, and the 
LASSO regression coefficients. It is evident that the LASSO solution contains a much smaller 
number of non-zeros compared to the ridge regression. Most of the high-order coefficients 
are zero. By contrast, the vanilla regression coefficients are wild. The ridge regression is 
better, but there are many non-zero high-order coefficients. 


Closing remark. In this section, we discussed two regularization techniques: ridge regres- 
sion and LASSO regression. Both techniques are about adding a penalty term to the training 
loss to constrain the regression coefficients. In the optimization literature, writings on ridge 
and LASSO regression are abundant, covering both algorithms and theoretical properties. 
An example of a theoretical question addressed in the literature is: Under what conditions 
is LASSO guaranteed to recover the correct support of the solution, i.e., locating the correct 
positions of the non-zeros? Problems like these are beyond the scope of this book. 
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Figure 7.28: Coefficients of the regression models. (a) The ground truth model, which is a 4th-order 
polynomial. There are only 5 non-zero coefficients. (b) The vanilla regression coefficients. Note that 
the values are wild and large, although the curve fits the training data points very well. (c) The ridge 
regression coefficients. While the overall magnitudes are significantly improved from the vanilla, some 
high-order coefficients are still non-zero. (d) The LASSO regression coefficients. There are very few 
non-zeros, and the non-zeros match well with the ground truth. 


7.5 Summary 


Regression is one of the most widely used techniques in data science. The formulation of the 
regression problem is as simple as setting up a system of linear equations: 


inimi XO-—yll’, 7.42 
minimize | y| (7.42) 


which has a closed-form solution. The biggest problems in practice are outliers, lack of 
training samples, and poor choice of the regression model. 


e Outliers: We always recommend plotting the data whenever possible to check if there 
are obvious outliers. There are also statistical tests in which you can evaluate the 
validity of your samples. One simple way to debug outliers is to run the regression 
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and check the prediction error against each training sample. If you have an outlier, 
and if your model is of reasonably low complexity, then a sample with an excessively 
large prediction error is an outlier. For example, if most of the training samples are 
within one standard deviation from your prediction but a few are substantially off, 
you will know which ones are the outliers. Robust linear regression is one technique for 
countering outliers, but an experienced data scientist can often reject outliers before 
running any regression algorithms. Domain knowledge is of great value for this purpose. 


e Lack of training samples: As we have discussed in the overfitting section, it is ex- 
tremely important to ensure that your model complexity is appropriate for the number 
of training samples. If the training set is small, do not use a complex model. Regu- 
larization techniques are valuable tools to mitigate overfitting. However, choosing a 
good regularization requires domain knowledge. For example, if you know that some 
features are not important, you need to scale them properly so as not to over-influence 
the regression solution. 


e Wrong model: We have mentioned several times that regression can always return you 
a result because regression is an optimization problem. However, whether that result 
is meaningful depends on how meaningful your regression problem is. For example, if 
the noise is i.i.d. Gaussian, a data fidelity term with || - ||? would be a good choice; 
however, if the noise is i.i.d. Poisson, || - ||? would become a very bad model. We need 
a tighter connection with the statistics of the underlying data-generation model for 
problems like these. This is the subject of our next chapter, on parameter estimation. 


7.6 References 


Linear regression 


Treatment of standard linear regression is abundant. In the context of machine learning and 
data science, the following references are useful. 


7-1 Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani, An Introduction 
to Statistical Learning with Applications in R, Springer 2013, Chapter 3. 


7-2 Stephen Boyd and Lieven Vandenberghe, Conver Optimization, Cambridge University 
Press, 2004. Chapter 6. 


7-3 Trevor Hastie, Robert Tibshirani, and Jerome Friedman, The Elements of Statistical 
Learning, Springer, 2001. Chapter 3. 


7-4 Christopher Bishop, Pattern Recognition and Machine Learning, Springer 2006. Chap- 
ter 3.1. 


7-5 Yaser Abu-Mostafa, Malik Magdon-Ismail and Hsuan-Tien Lin, Learning from Data, 
AML Book, 2012. Chapter 3.2 


Overfitting and Bias/Variance 


The theory of overfitting and the trade-off between bias and variance can be found in multiple 
references. The following are basic treatments of the subject. 
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7-6 Yaser Abu-Mostafa, Malik Magdon-Ismail and Hsuan-Tien Lin, Learning from Data, 
AML Book, 2012. Chapter 4. 


7-7 Christopher Bishop, Pattern Recognition and Machine Learning, Springer 2006. Chap- 
ter 3.2. 


Ridge and LASSO regression 


Ridge and LASSO regression are important tools in statistical learning today. The following 
two textbooks cover some of the perspectives of the statistical community and the signal 
processing community. 


7-8 Trevor Hastie, Robert Tibshirani, and Martin Wainwright, Statistical Learning with 
Sparsity: The LASSO and Generalizations, CRC Press, 2015. 


7-9 Michael Elad, Sparse and Redundant Representations, Springer, 2010. Chapters 1 
and 3. 


7.7 Problems 


Exercise 1. 
(a) Construct a dataset with N = 20 samples, following the model 


d-1 
Yn = > OpLip(an) + en, (7.43) 
p=0 
where 0) = 1, 6; = 0.5, 02 = 0.5, 63 = 1.5, 64 = 1, for -1 < a < 1. Here, L,(x) is the 
Legendre polynomial of the pth order. The N = 20 samples are random uniformly sam- 
pled from the interval [—1, 1]. The noise samples e,, are i.i.d. Gaussian with variance 
o? = 0.257. Plot the dataset using the MATLAB or Python command scatter. 


(b) Run the regression using the same model where d = 5, without any regularization. 
Plot the predicted curve and overlay with the training samples. 


(c) Repeat (b) by running the regression with d = 20. Explain your observations. 


— 
& 


Increase the number of training samples N to N = 50, N = 500, and N = 5000, and 
repeat (c). Explain your observations. 


(e) Construct a testing dataset with M = 1000 testing samples. For each of the regression 


models trained in (b)-(d), compute the testing error. 


Exercise 2. 
Consider a data generation model 


N-1 
—j 2akn 
tn = ) cee 2? N ,n=0,...,N—-1. 
k=0 
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(a) Write the above equation in matrix-vector form 
a= We. 
What are the vectors c and x, and what is the matrix W? 


(b) Show that W is orthogonal, i-e.,, W"W =I, where W” is the conjugate transpose 
of W. 


(c) Using (b), derive the least squares regression solution. 


Exercise 3. 
Consider a simplified LASSO regression problem: 


6 =argmin ly — 0? + AljA||1. (7.44) 
OcR¢ 


Show that the solution is given by 


n~ 


6 = sign(y) - max (|y| — A, 0), (7.45) 
where - is the elementwise multiplication. 


Exercise 4. 
A one-dimensional signal is corrupted by blur and noise: 


L-1 


Yn = S- hetn—e + €n.- 
£=0 


(a) Formulate the least squares regression problem in matrix-vector form y = Ha +e. 
Find a, y and H. 


(b) Consider a regularization function 


N 


R(x) = SG =fn-1)'. 


n=2 
Show that this regularization is equivalent to R(x) = ||Dza||? for some D. Find D. 
(c) Using the regularization in (b), derive the regularized least squares regression result: 


minimize ||y— Ha||? + \||Dz|l?. 


Exercise 5. 
Let o(-) be the sigmoid function 


We want to use o(a) as a basis function. 
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(a) Show that the tanh function and the sigmoid function are related by 


tanh(a) = 20(2a) — 1. 


(b) Show that a linear combination of sigmoid functions 


d—1 
nat iyo (= — Hs) 
s 
p=1 


is equivalent to a linear combination of tanh functions 


d—1 
ie Sacph 
Yn = 0 + 5° a, tanh (=). 


p=1 


(c) Find the relationship between 6, and ap. 


Exercise 6. (NHANES Part 1)(DATA DOWNLOAD) 

The National Health and Nutrition Examination Survey (NHANES) is a program to assess 
the health and nutritional status of adults and children in the United States®. The complete 
survey result contains over 4,000 samples of health-related data of individuals who partici- 
pated in the survey between 2011 and 2014. In the following exercises, we will focus on two 
categories of the data for each individual: height (in mm) and body mass index (BMI). The 
data is divided into two classes based on gender. Table 1 contains snippets of the data. 


S| eC 
index female bmi female stature mm index male bmi male stature mm 
0 28.2 1563 0 30 1679 
1 22.2 1716 1 25.6 1586 
2 27.1 1484 2 24.2 1773 
3 28.1 1651 3 27.4 1816 


Table 7.2: Male and Female Data Snippets 


Use csv.reader to read the training data files for the two data classes. 
Important! Before proceeding to the problems, 


e normalize the number in male_stature_mm and female_stature_mm by dividing them 


by 1000, and 
e normalize that of male_bmi and female_bmi by dividing them by 10. 


This will significantly reduce the numerical error. 


Consider a linear model: 
go = 0" x, (7.46) 


Shttps://www.cdc.gov/nchs/nhanes/ index.htm 
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The regression problem we want to solve is 


N 
6 = argmin a, (Yn — Go(&n))° , 
deR¢4 n=1 


where D = {(an, yn)}*_, is the training dataset. Putting the equation into the matrix form, 


we know that the optimization is equivalent to 
@ =argmin ||y — X6||?. 
@ER¢A _——_—_——_—_—_—_—_— 
Etrain (9) 
(a) Derive the solution @. State the conditions under which the solution is the unique 


global minimum in terms of the rank of X. Suggest two techniques that can be used 
when X‘ X is not invertible. 


(b) For the NHANES dataset, assign y, = +1 if the nth sample is a male and y,, = —1 
if the nth sample is a female. Implement your answer in (a) with Python to solve the 
problem. Report your answer. 


(c) Repeat (b), but this time use CVXPY. Report your answer, and compare with (b). 


Exercise 7. (NHANES Part 2)(DATA DOWNLOAD) 
We want to do a classification based on the linear model we found in the previous exercise. 
The classifier we will use is 


predicted label = sign(ge(x)), (7.47) 


where x € R®¢ is the a test sample. Here, we label +1 for male and —1 for female. Because 
the dataset we consider in this exercise has only two columns, the linear model is 


go(x) => Ao + 0121 + 0222, 


where x = [1,21,22]|7 is the input data and @ = [40,01,02|" is the parameter vector. 
(a) First, we want to visualize the classifier. 
(i) Plot the training data points of the male and female classes. Mark the male class 
with blue circles and the female class with red dots. 


(ii) Plot the decision boundary g@(-) and overlay it with the data plotted in (a). 
Hint: go(-) is a straight line in 2D. You can express x2 in terms of 21 and other 
parameters. 


(b) (This problem requires knowledge of the content of Chapter 9). Report the classifica- 
tion accuracy. To do so, take testing data x and compute the prediction according to 
Equation (7.47). 


(i) What is the Type 1 error (False Alarm) of classifying males? That is, what is the 
percentage of testing samples that should be female but a male was predicted? 


(ii) What is the Type 2 error (Miss) of classifying males? That is, what is the per- 
centage of testing samples that should be male but a female was predicted? 
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(iii) What is the precision and recall for this classifier? For the definitions of precision 
and recall, refer to Chapter 9.5.4. 


Exercise 8. (NHANES Part 3)(DATA DOWNLOAD) 
This exercise requires some background in optimization. Please refer to Reference [7.2, Chap- 
ter 9 and 10]. Consider the following three optimization problems: 


6, =argmin ||XO — y||3 + Al|6l|3, (7.48) 
OcR¢ 

6, =argmin ||X@— y||2 subjectto |||? <a, (7.49) 
OER? 

6. =argmin ||6||2 subjectto ||XO— yl|2 <«. (7.50) 
OcR¢ 


(a) Set lambd = np.arange(0.1,10,0.1). Plot 


e ||XO) — y|2 as a function of ||04|)2. 
e || XO) — y|[2 as a function of A. 


e ||4)||3 as a function of X. 


(b) (i) Write down the Lagrangian for each of the three problems. Note that the first 
problem does not have any Lagrange multiplier. For the second and third prob- 
lems you may use the following notations: 


° Yq = the Lagrange multiplier of Equation (7.49), and 
e y = the Lagrange multiplier of Equation (7.50). 


State the first-order optimality conditions (the Karush-Kuhn-Tucker or KKT 
conditions) for each of the three problems. Express your answers in terms of X, 
0, y, A, a, €, and the two Lagrange multipliers ye, Ye. 


—— 
me 
me 

wa 


(iii) Fix \ > 0. We can solve Equation (7.48) to obtain 6). Find a and the Lagrange 


multiplier Yq. in Equation (7.49) such that 6) would satisfy the KKT conditions 
of Equation (7.49). 


(iv) Fix \ > 0. We can solve Equation (7.48) to obtain 6). Find € and the Lagrange 


multiplier y. in Equation (7.50) such that 6 , would satisfy the KKT conditions 
of Equation (7.50). 


(v) Fix \ > 0. By using the a and ¥q you found in (iii), you can show that 6, would 
satisfy the KKT conditions of Equation (7.49). Is it enough to claim that Oy is 
the solution of Equation (7.49)? If yes, why? If no, what else do we need to show? 
Please elaborate through a proof, if needed. 


Exercise 9. 
Consider a training dataset Dirain = {(@1, y1),---,(@n, yn) } and a weight w = [w1,...,wy]”. 
Find the regression solution to the following problem and discuss how you would choose the 
weight. 


N 
6 = argmin S- Wn (Yn — a9) : (7.51) 


OCR? 7-1 
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Exercise 10. 


Consider a training dataset Dirain = {(@1,Y1),---,(@n, yn)}. Suppose that the input data 
x, is corrupted by i.i.d. Gaussian noise e, ~ Gaussian(0,07Iq) so that the training set 
becomes Dtrain = {(@1 + €1,Y1),---,(@n + en, yn)}. Show that the (vanilla) least squares 


linear regression by taking the expectation over ep, 


6 = argmin S- Ven [ (on — (a, + en)"8) | ; (7.52) 


is equivalent to a ridge regression. 
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Estimation 


In this chapter, we discuss another set of important combat skills in data science, namely es- 
timation. Estimation has a close relationship with regression. Regression primarily takes the 
optimization route, while estimation takes the probabilistic route. As we will see, at a cer- 
tain point the two will merge. That is, under some specific statistical conditions, estimation 
processes will coincide with the regression. 

Estimation is summarized pictorially in Figure 8.1. Imagine that we have some random 
samples X,,...,Xj. These samples are drawn from a distribution fx(x;0), where @ is a 
parameter that characterizes the distribution. The parameter @ is not known to us. The 
goal of estimation is to solve an inverse problem to recover the parameter based on the 
observations Xj,..., Xn. 


K Distributi 
DE DK istribution 
Parameter 7 Samples 


0 Aig Adyics5AN 


Figure 8.1: Estimation is an inverse problem of recovering the unknown parameters that were used by 
the distribution. In this figure, the PDF of X using a parameter @ is denoted as fx (2; 0). The forward 
data-generation process takes the parameter @ and creates the random samples X1,...,X~. Estimation 
takes these observed random samples and recovers the underlying model parameter 8. 


What is estimation? 


Estimation is an inverse problem with the goal of recovering the underlying pa- 
rameter 6 of a distribution fx (2;@) based on the observed samples X1,..., Xn. 
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What are parameters? 


Before we discuss the methods of estimation, let us clarify the meaning of the parameter 0. 
All probability density functions (PDFs) have parameters. For example, a Bernoulli random 
variable is characterized by a parameter p that defines the probability of getting a “head”. A 
Gaussian random variable is characterized by two parameters: the mean jz and variance o?. 


Example 8.1. (Parameter of a Bernoulli) If X,, is a Bernoulli random variable, then 
the PMF has a parameter 0: 


0x, (on3 0) =0' 7 (1— g)'-*s, 


Remark. The PMF is expressed in this form because x, is either 1 or 0: 


Example 8.2. (Parameter of a Gaussian) If X,, is a Gaussian random variable, the 
PDF is 


1 
V2r02 ia { 


where @ = [1,0] consists of both the mean and the variance. We can also designate 
the parameter 6 to be the mean only. For example, if we know that o = 1, then the 
PDF is 


i 
In; 0 >) = ——ex 
fraltni 8) = Fee p{ 


where @ is the mean. 


Since all probability density functions have parameters, estimating them from the 
observed random variables is a well-defined inverse problem. Of course, there are better 
estimates and there are worse estimates. Let us look at the following example to develop 
our intuitions about estimation. 

Figure 8.2 shows a dataset containing 1000 data points generated from a 2D Gaussian 
distribution with an unknown mean vector yw and an unknown covariance matrix X. We 
duplicate this dataset in the four subfigures. The estimation problem is to recover the 
unknown mean vector yw and the covariance matrix U. In the subfigures we propose four 
candidates, each with a different mean vector and a different covariance matrix. We draw 
the contour lines of the corresponding Gaussians. It can be seen that some Gaussians fit the 
data better than others. The goal of this chapter is to develop a systematic way of finding 
the best fit for the data. 


Plan for this chapter 


The discussions in this chapter concern the three elementary distributions: 
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Figure 8.2: An estimation problem. Given a set of 1000 data points drawn from a Gaussian distribution 
with unknown mean ys and covariance J, we propose several candidate Gaussians and see which one 
would be the best fit to the data. Visually, we observe that the right-most Gaussian has the best fit. 
The goal of this chapter is to develop a systematic way of solving estimation problems of this type. 


e Likelihood: fx \@(«#|@), which is the conditional PDF of X given that the parameter 
is O. 

e Prior: fo(@), which is the PDF of O. 

e Posterior: fe) x (|x), which is the conditional PDF of © given the data X. 


Each of these density functions has its respective meaning, and consequently a set of different 
estimation techniques. In Section 8.1 we introduce the concept of maximum-likelihood (ML) 
estimation. As the name suggests, the estimate is constructed by maximizing the likelihood 
function. We will discuss a few examples of ML estimation and draw connections between 
ML estimation and regression. In Section 8.2 we will discuss several basic properties of an 
ML estimate. Specifically, we will introduce the ideas of unbiasedness, consistency, and the 
invariance principle. 

The second topic discussed in this chapter is the maximum-a-posteriori (MAP) esti- 
mation, detailed in Section 8.3. In MAP, the parameter © is a random variable. Since © is a 
random variable, it has its own probability density function f@(@), which we call the prior. 
Given the likelihood and the prior, we can define the posterior. The MAP estimation finds 
the peak of the posterior distribution as a way to “explain” the data. Several important 
topics will be covered in Section 8.3. For example, we will discuss the choice of the prior 
via the concept of conjugate prior. We will also discuss how MAP is related to regularized 
regressions such as the ridge and LASSO regressions. 

The third topic is the minimum mean-square estimation (MMSE), outlined in Sec- 
tion 8.4. The MMSE is a Bayesian approach. An important result that will be demonstrated 
is that the MMSE estimate is the conditional expectation of the posterior distribution. In 
other words, it is the mean of the posterior. An MMSE estimate has an important difference 
compared to a MAP estimate, namely that while an MMSE estimate is the mean of the 
posterior, a MAP estimate is the mode of the posterior. We discuss the formulation of the 
estimation problem and ways of solving the problem. We also discuss how the MMSE can 
be performed for multidimensional Gaussian distributions. 
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8.1 Maximum-Likelihood Estimation 


Maximum-likelihood (ML) estimation, as the name suggests, is an estimation method that 
“maximizes” the “likelihood”. Therefore, to understand the ML estimation, we first need to 
understand the meaning of likelihood, and why maximizing the likelihood would be useful. 


8.1.1 Likelihood function 


Consider a set of N data points D = {x1,22,...,2n}. We want to describe these data points 
using a probability distribution. What would be the most general way of defining such a 
distribution? 

Since we have N data points, and we do not know anything about them, the most gen- 
eral way to define a distribution is as a high-dimensional probability density function (PDF) 
fx(x). This is a PDF of a random vector X = [X1,...,Xn]?. A particular realization of 
this random vector is x = [71,...,2N]". 

fx (a) is the most general description for the N data points because fx(a) is the 
joint PDF of all variables. It provides the complete statistical description of the vector X. 
For example, we can compute the mean vector E[X], the covariance matrix Cov(X), the 
marginal distributions, the conditional distribution, the conditional expectations, etc. In 
short, if we know fx (a), we know everything about X. 

The joint PDF fx (a) is always parameterized by a certain parameter 0. For example, if 
we assume that X is drawn from a joint Gaussian distribution, then fx (x) is parameterized 
by the mean vector zz and the covariance matrix 4. So we say that the parameter @ is 
6 = (yt, =). To state the dependency on the parameter explicitly, we write 


fx(x; 0) = PDF of the random vector X with a parameter 6. 


When you express the joint PDF as a function of x and 0, you have two variables to 
play with. The first variable is the observation x, which is given by the measured data. We 
usually think about the probability density function fx (a) in terms of 2, because the PDF 
is evaluated at X = a. In estimation, however, x is something that you cannot control. 
When your boss hands a dataset to you, x is already fixed. You can consider the probability 
of getting this particular x, but you cannot change a. 

The second variable stated in fx(a; @) is the parameter 6. This parameter is what 
we want to find out, and it is the subject of interest in an estimation problem. Our goal is 
to find the optimal @ that can offer the “best explanation” to data x, in the sense that it 
can maximize fx (a; 0). 

The likelihood function is the PDF that shifts the emphasis to 0: 


Definition 8.1. Let X =[X,..., Xn]? be a random vector drawn from a joint PDF 
fx(x;0), and let x = [x1,...,2N]* be the realizations. The likelihood function is a 


function of the parameter @ given the realizations x: 


L(0 |x) fx(a; 0). (8.1) 
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A word of caution: £(@| a) is not a conditional PDF because @ is not a random variable. 
The correct way to interpret £(@|a) is to view it as a function of 6. This function changes 
its shape according the observed data x. We will return to this point shortly. 


Independent observations 


While fx (x) provides us with a complete picture of the X, using fx (ax) is tedious. We need 
to describe how each X,, is generated and describe how X,, is related to X,, for all pairs of 
n and m. If the vector X contains N entries, then there are N?/2 pairs of correlations we 
need to compute. When N is large, finding fx (a) would be very difficult if not impossible. 
In practice, fx (a) may sometimes be overkill. For example, if we measure the inter- 
arrival time of a bus for several days, it is quite likely that the measurements will not be 
correlated. In this case, instead of using the full fx (a), we can make assumptions about 
the data points. The assumption we will make is that all the data points are independent 
and that they are drawn from an identical distribution fx(a). The assumption that the 
data points are independently and identically distributed (i.i.d.) significantly simplifies the 
problem so that the joint PDF fx can be written as a product of single PDFs fx,: 


N 
fe (@) = Paik Gis eg¢y) = |] fe, (on). 
=1 


If you prefer a visualization, we can take a look at the covariance matrix, which goes 
from a full covariance matrix to a diagonal matrix and then to an identity matrix: 


Var[X,] Cov(X1,X2) +++ Cov(X1, Xn) Var[X4] 0 tee 0 
Cov[X2, X1] Var[X2] +++ Cov(X2, Xn) 0 Var[X2] --- 0 
: : ne : independent : ai ; 
Cov(Xn,X1) Cov(Xn,X2) --- Var[X yn] 0 0 s+ Var[Xyn] 

oF 0 + 0 

0 oF -. 0 

identical : : ° : 

0 0 «+ o 


The assumption of iid. is strong. Not all data can be modeled as i.i.d. (For example, 
photons passing through a scattering medium have correlated statistics.) However, if the 
i.i.d. assumption is valid, we can simplify the model significantly. 

If the data points are i.i.d., then we can write the joint PDF as 


N 
0) = II fx, (@n3 0) 


n=1 


This simplifies the likelihood function as a product of the individual PDFs. 


Definition 8.2. Given i.i.d. random variables X1,...,Xy that all have the same PDF 
fx,,(&n), the likelihood function is 


N 
Ble) Millees (nj 9). (8.2) 


In computation we often take the log of the likelihood function. We call the resulting function 
the log-likelihood. 
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Definition 8.3. Given a set of i.i.d. random variables X1,..., Xn with PDF fx, (x;; 0), 
the log-likelihood is defined as 


log £(@| xz) = log fx (x; 8) = Yoox Ln; 8). (8.3) 


Example 8.3. Find the log-likelihood of a sequence of i.i.d. Gaussian random variables 
X,,...,Xy with mean p and variance o?. 


Solution. Since the random variables X,...,Xy are i.i.d. Gaussian, the PDF is 


Cane = 1 {es aca (8.4) 


one 


Taking the log on both sides yields the log-likelihood function: 


log L(p, 07 | x) = log fx (a; 1,07) 


Practice Exercise 8.1. Find the log-likelihood of a sequence of i.i.d. Bernoulli random 
variables X1,...,Xy with parameter 0. 


Solution. If X),...,Xy are iid. Bernoulli random variables, we have 
N 
fx(v; 6) = T] {or — ayo} 
i 


Taking the log on both sides of the equation yields the log-likelihood function: 


log £(0| x) = log Il {er = ore 


ni 
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log {er s aye 


Ln log 6 + (1 — zy) log(1 — 8) 


on) loge (1 - om): log(1 — 6). 


Visualizing the likelihood function 


The likelihood function £(@| a) is a function of 6, but its value also depends on the under- 
lying measurements x. It is extremely important to keep in mind the presence of both. 

To help you visualize the effect of @ and x, we consider a set of i.i.d. Bernoulli random 
variables. As we have just shown in the practice exercise, the likelihood function of these 
iid. random variables is 


N 
log £(0 | a) -(> rn) -log@ + (» - > on) -log(1 — 0), (8.5) 


n=1 
~ 


S N-S 


where we define S = ae Lp, as the sum of the (binary) measurements. 
To make the dependency on S' and @ explicit, we write £(6 |x) as 


log £(6 | S) = Slog@+ (N — S) log(1 — 8), (8.6) 


which emphasizes the role of S in defining the log-likelihood function. We plot the surface 
of L(@|S) as a function of S and 6, assuming that N = 50. As shown on the left-hand side 
of Figure 8.3, the surface L(6|.S') has a saddle shape. Along one direction the function goes 
up, whereas along another direction the function goes down. In the middle of Figure 8.3, 
we show a bird’s-eye view of the surface, with the color-coding matched with the surface 
plot. As you can see, when plotted as a function of @ and a (in our case, we use a summary 
statistic S = sae Xn), the two-dimensional plot tells us how the log-likelihood function 
changes when S changes. On the right-hand side of Figure 8.3, we show two particular 
cross sections of the two-dimensional plot. One cross section is taken from S = 25 and the 
other cross section is taken from S = 12. Since the total number of heads in this numerical 
experiment is assumed to be N = 50, the first cross section at S = 25 is obtained when 
half of the Bernoulli measurements are “1”, whereas the second cross section at S = 12 is 
obtained when a quarter of the Bernoulli measurements are “1”. 

The cross sections tell us the log-likelihood function log £(6|S) is a function defined 
specifically for a given measurement x. As you can see from Figure 8.3, the log-likelihood 
function changes when S changes. Therefore, if our goal is to “find a @ that maximizes the 
log-likelihood function”, then for a different a we will have a different answer. For example, 
according to Figure 8.3, the maximum for log £(@|S = 25) occurs when 6 = 0.5, and the 
maximum for log £(@|S = 12) occurs when 6 = 0.24. These are the maximum-likelihood 
estimates for the respective measurements. 
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log £L(0 |S’) 


Birds eye view 


log £(0| S = 25) 
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Figure 8.3: We plot the log-likelihood function as a function of S = $~_, x, and 0. [Left] We show 
the surface plot of £(@|S) = Slog@ + (N — S) log(1 — @). Note that the surface has a saddle shape. 
[Middle] By taking a bird's-eye view of the surface plot, we obtain a 2-dimensional contour plot of the 
surface, where the color code matches the height of the log-likelihood function. [Right] We take two 


cross sections along S = 25 and S = 12. Observe how the shape changes. 


We use the following MATLAB code to generate the surface plot: 


MATLAB code to generate the surface plot 
= 50; 
= 1:N; 
theta = linspace(0.1,0.9,100); 
[S_grid, theta_grid] = meshgrid(S, theta); 
L = S_grid.*log(theta_grid) + (N-S_grid) .*log(1-theta_grid) ; 
s surf (S,theta,L); 
s.LineStyle = ’-’; 
colormap jet 
view(65,15) 


For the bird’s-eye view plot, we replace surf with imagesc(S,theta,L). For the cross 
section plots, we call the commands plot (theta, L(:,12)) and plot(theta, L(:,25)). 


8.1.2. Maximum-likelihood estimate 


The likelihood is the PDF of X but viewed as a function of 0. The optimization problem 
of maximizing £(@| a) is called the maximum-likelihood (ML) estimation: 


Definition 8.4. Let £(0) be the likelihood function of the parameter @ given the 
measurements x = [r1,...,2N]’. The maximum-likelihood estimate of the parameter 
0 is a parameter that maximizes the likelihood: 


Our = argmax L(O|a). (8.7) 
6 
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Example 8.4. Find the ML estimate for a set of ii.d. Bernoulli random variables 
{X1,..., Xn} with X, ~ Bernoulli(@) for n =1,...,N. 


Solution. We know that the log-likelihood function of a set of i.i.d. Bernoulli random 
variables is given by 


log L(0| x) = (>: mn) -log @ + (» - Sos): log(1 — 6). (8.8) 


Thus, to find the ML estimate, we need to solve the optimization problem 


N 
On = argmax 2 en) -log@ + (» 2s an) log(1— ot. 
a n=1 


Taking the derivative with respect to 6 and setting it to zero, we obtain 


(3%) og + (v - om): log a-oh=o 


This gives us 


Let’s do a sanity check to see if this result makes sense. The solution to this problem 
says that Oy, is the empirical average of the measurements. Assume that N = 50. Let us 
consider two particular scenarios as illustrated in Figure 8.4. 


def WN ; 
° 2. 1: x is a vector of tame such that S = ita = 2h. Sines 


= 50, the formula tells us that Osat — a = 0.5. This is the best guess based on the 
i measurements where 25 are heads. If you look at Figure 8.3 and Figure 8.4, when 
S = 25, we are looking at a particular cross section in the 2D plot. The likelihood 
function we are inspecting is £(0|S = 25). For this likelihood function, the maximum 
occurs at 6 = 0.5. 


e Scenario 2: x isa ead of measurements such that 5 Sale 1 @n = 12. The formula 
tells us that baie — = = 0.24. This is again the best guess based on the 50 measure- 


ments where 12 are heads. Referring to Figure 8.3 and Figure 8.4, we can see that 
the likelihood function corresponds to another cross section £(6|S = 12) where the 
maximum occurs at @ = 0.24. 


At this point, you may wonder why the shape of the likelihood function £(6 | a) changes 
so radically as x changes? The answer can be found in Figure 8.5. Imagine that we have 
N = 50 measurements of which S = 40 give us heads. If these i.i.d. Bernoulli random 
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Figure 8.4: Illustration of how the maximum-likelinood estimate of a set of i.i.d. Bernoulli random 
variables is determined. The subfigures above show two particular scenarios at S = 25 and S = 12, 
assuming that N = 50. When S = 25, the likelihood function has a quadratic shape centered at 
6 = 0.5. This point is also the peak of the likelihood function when S = 25. Therefore, the ML estimate 
is Ou_ = 0.5. The second case is when S = 12. The quadratic likelihood is shifted toward the left. The 
ML estimate is @y_ = 0.24. 


variables have a parameter 0 = 0.5, it is quite unlikely that we will get 40 out of 50 
measurements to be heads. (If it were 6 = 0.5, we should get more or less 25 out of 50 
heads.) When S$ = 40, and without any additional information about the experiment, the 
most logical guess is that the Bernoulli random variables have a parameter 0 = 0.8. Since 
the measurement S can be as extreme as 0 out of 50 or 50 out of 50, the likelihood function 
L£(0|a) has to reflect these extreme cases. Therefore, as we change 2, we observe a big 
change in the shape of the likelihood function. 

As you can see from Figure 8.5, S = 40 corresponds to the marked vertical cross 
section. As we determine the maximum-likelihood estimate, we search among all the possi- 
bilities, such as 6 = 0.2, 0 = 0.5, 6 = 0.8, etc. These possibilities correspond to the horizontal 
lines we drew in the figure. Among those horizontal lines, it is clear that the best estimate 
occurs when 0 = 0.8, which is also the ML estimate. 


Visualizing ML estimation as N grows 


Maximum-likelihood estimation can also be understood directly from the PDF instead of 
the likelihood function. To explain this perspective, let’s do a quick exercise. 


Practice Exercise 8.2. Suppose that X, is a Gaussian random variable. Assume 
that o = 1 is known but the mean @ is unknown. Find the ML estimate of the mean. 


Solution. The ML estimate ite is 


Cum =argmax log L(0| x). 
0 
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Z-\ Measurement says: S = 40 


@©000000000 


secceeeeco 
Which model parameter is more likely? 
e 0=0.2 
© 0=0.5 


a 0=0.8 


0 = 0.8 is the best explanation for S = 40. 


Figure 8.5: Suppose that we have a set of measurements such that S = 40. To determine the ML 
estimate, we look at the vertical cross section at S = 40. Among the different candidate parameters, 
eg., 0 = 0.2, 0 = 0.5 and 0 = 0.8, we pick the one that has the maximum response to the likelihood 
function. For S = 40, it is more likely that the underlying parameter is 9 = 0.8 than 0 = 0.2 or 0 = 0.5. 


With some calculation, we can show that 


Our, = — goes log ‘Iz a= Se - a y 


N 
N 1 
= ees acy log(27) — 5 wea 


il 


Taking the derivative with respect to 6, we obtain 


df{ N ee 
= d= pxOn) — = 2\=0. 
| 5 een) Qo 


This gives us wie — 6) =0. Therefore, the ML estimate is 


ee i) 
Ou = VW tn 


mil 


Now we will draw the PDF and compare it with the measured data points. Our focus 
is to analyze how the ML estimate changes as N grows. 

When N = 1. There is only one observation x;. The best Gaussian that fits this sample 
must be the one that is centered at x,. In fact, the optimization is! 


~ 1 (x, — 0)? 2 
ML a og { 2 exp { 5G2 apa (a, — 0)" = 2 


1We skip the step of checking whether the stationary point is a maximum or a minimum, which can be 
done by evaluating the second-order derivative. In fact, since the function —(# 1 — 0)? is concave in 0, a 
stationary point must be a maximum. 
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Therefore, the ML estimate is Oar, = £1. Figure 8.6 illustrates this case. As we conduct the 
ML estimation, we imagine that there are a few candidate PDF's. The ML estimation says 
that among all these candidate PDFs we need to find one that can maximize the probability 
of obtaining the observation x1. Since we only have one observation, we have no choice but 
to pick a Gaussian centered at x1. Certainly the sample X; = x, could be bad, and we may 
find a wrong Gaussian. However, with only one sample there is no way for us to make better 
decisions. 


0.5 
0.45 + —® Data Point 1 
0.4; Candidate PDF | + 


—— Estimated PDF 


Figure 8.6: N = 1. Suppose that we are given one observed data point located around x = —2.1. To 
conduct the ML estimation we propose a few candidate PDFs, each being a Gaussian with unit variance 
but a different mean 0. The ML estimate is a parameter 0 such that the corresponding PDF matches 
the best with the observed data. In this example the best match happens when the estimated Gaussian 
PDF is centered at x1. 


When N = 2. In this case we need to find a Gaussian that fits both x; and x2. The 
probability of simultaneously observing x, and x2 is determined by the joint distribution. 
By independence we then have 


a 1 3 (21 — 0)? + (a2 — 0)”) 
Out = l —— 
ML argmax og ( =) exp { 502 \ 


(x1 — 0)? + (xq — 0)? = +29 
20? a 2 


= argmax 
0 


where the last step is obtained by taking the derivative: 


d 


do { (x1 — 0)? + (x2 — 0)?} = (21 — 0) + 2(x2 — 8). 


Equating this with zero yields the solution @ = “**2. Therefore, the best Gaussian that 
fits the observations is Gaussian(=$*2 , 0”). 

Does this result make sense? When you have two data points x; and x2, the ML 
estimation is trying to find a Gaussian that can best fit both of these two data points. 
Your best bet. here is but = = (41 + £2)/2, because there are no other choices. If you choose 
ban = 21 or bu = = £2, it cannot be a good estimate because you are not using both data 
points. As shown in Figure 8.7, for these two observed data points x; and x2, the PDF 
marked in red (which is a Gaussian centered at (1 + %2)/2) is indeed the best fit. 
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0.5 ; r ; 
0.45 + —© Data Point 
0.47 Candidate PDF | 7 
— Estimated PDF 


Xx 
Figure 8.7: N = 2. Suppose that we are given two observed data points located around x; = —0.98 
and x2 = —1.15. To conduct the ML estimation we propose a few candidate PDFs, each being a 


Gaussian with unit variance but a different mean 9. The ML estimate is a parameter @ such that the 
corresponding PDF best matches the observed data. In this example the best match happens when the 
estimated Gaussian PDF is centered at (21 + 22)/2  —1.07. 


When N = 10 and N = 100. We can continue the above calculation for N = 10 and 
N = 100. In this case the MLE is 


oe (tn — 0)? 1 
= argmax — > C—O" 1 ny, 


n=1 n=1 


where the optimization is solved by taking the derivative: 


a N N 
Bn 6)? =-25 "(an — 8) 


Equating this with zero yields the solution @= + yy Lins 

The result suggests that for an arbitrary number of training samples the ML estimate 
is the sample average. These cases are illustrated in Figure 8.8. As you can see, the red 
curves (the estimated PDF) are always trying to fit as many data points as possible. 

The above experiment tells us something about the ML estimation: 


How does ML estimation work, intuitively? 


e The likelihood function £(0|a) measures how “likely” it is that we will get a if 


the underlying parameter is 6. 


e In the case of a Gaussian with an unknown mean, you move around the Gaussian 
until you find a good fit. 
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0.5 ; ; ; ; ; ; ; ; ; 0.5 ; ; ; ; ; ; ; ; ; 
0.45 | —© Data Point 4 0.45 + —®© Data Point 
0.4+ Candidate PDF | ~ 0.4+ Candidate PDF | = 
—— Estimated PDF —— Estimated PDF | | 


(d) N = 100 


Figure 8.8: When N = 10 and N = 100, the ML estimation continues to evaluate the different 
candidate PDFs. For a given set of data points, the ML estimation picks the best PDF to fit the data 
points. In this Gaussian example it was shown that the optimal parameter is @m_ = (1/N) 3_, an, 


which is the sample average. 


8.1.3. Application 1: Social network analysis 


ML estimation has extremely broad applicability. In this subsection and the next we discuss 
two real examples. We start with an example in social network analysis. 

In Chapter 3, when we discussed the Bernoulli random variables, we introduced the 
Erdos-Rényi graph — one of the simplest models for social networks. The Erd6és-Rényi graph 
is a single-membership network that assumes that all users belong to the same cluster. Thus 
the connectivity between users is specified by a single parameter, which is also the probability 
of the Bernoulli random variable. 

In our discussions in Chapter 3 we defined an adjacency matrix to represent a graph. 
The adjacency matrix is a binary matrix, with the (i, 7)th entry indicating an edge connect- 
ing nodes 7 and j. Since the presence and absence of an edge is binary and random, we may 
model each element of the adjacency matrix as a Bernoulli random variable 


Xj; ~ Bernoulli(p). 


In other words, the edge X;; linking user 7 and user 7 in the network is either X;; = 1 with 
probability p, or X;; = 0 with probability 1 — p. In terms of notation, we define the matrix 
X €R*% as the adjacency matrix, with the (i, j)th element being X;;. 

A few examples of a single-membership Erdés-Rényi graph are shown in Figure 8.9. As 
the figure shows, the network connectivity increases as the Bernoulli parameter p increases. 
This happens because p defines the “density” of the edges. If p is large, we have a greater 
chance of getting X;; = 1, and so there is a higher probability that an edge is present 
between node 7 and node 7. If p is small, the probability is lower. 

Suppose that we are given one snapshot of the network, i.e., one realization 2 € RN*% 
of the adjacency matrix X € RN*%. The problem of recovering the latent parameter p can 
be formulated as an ML estimation. 
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(b) Adjacent matrices of the corresponding graphs. 
Figure 8.9: A single-membership Erdés-Rényi graph is a graph structure in which the edge between 
node i and node 7 is defined as a Bernoulli random variable with parameter p. As p increases, the graph 
has a higher probability of having more edges. The adjacent matrices shown in the bottom row are the 
mathematical representations of the graphs. 


Example 8.5. Write down the log-likelihood function of the single-membership Erdés- 
Rényi graph ML estimation problem. 


Solution. Based on the definition of the graph model, we know that 
Xj; ~ Bernoulli(p). 


Therefore, the probability mass function of X;; is 


This can be compactly expressed as 


fx(a; p) = 


Hence, the log-likelihood is 


N N 
log L(p| a) = {xij log p + (1 — a;;) log(1 — p)}. 


Now that we have the log-likelihood function, we can proceed to estimate the param- 
eter p. The solution to this is the ML estimate. 
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Practice Exercise 8.3. Solve the ML estimation problem: 


Dm_ = argmax log L(p|a). 
p 


Solution. Using the log-likelihood we just derived, we have that 


N WN 
pum. = >) > {aj logp + (1 — 2.) log —p)}. 


i=1 j=1 


Taking the derivative and setting it to zero, 


Flog (p|2) = 5 yb es logp (1 —a,,) lex(l— p)} 


Let $ = ae ‘ SH —1 vij. The equation above then becomes 


S Neo 
p l= 9 


Rearranging the terms yields (1 — p)S = p(N? — S$), which gives us 


ieee 
DML = a = Wee DL Su 


i=1 j=l 


On computers, visualizing the graphs and computing the ML estimates are reasonably 
straightforward. In MATLAB, you can call the command graph to build a graph from the 
adjacency matrix A. This will allow you to plot the graph. The computation, however, is done 
directly by the adjacency matrix. In the code below, you can see that we call rand to generate 
the Bernoulli random variables. The command triu extracts the upper triangular matrix 
from the matrix A. This ensures that we do not pick the diagonals. The symmetrization of 
A+A’ ensures that the graph is indirectional, meaning that 7 to 7 is the same as j to 2. 


MATLAB code to visualize a graph 

= 40; # Number of nodes 
0.3 # probability 
rand(n,n)<p; 
triu(A,1); 


= At+A’; # Adj matrix 
graph(A); # Graph 
plot (G); # Drawing 
p_ML = mean(A(:)); # ML estimate 
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In Python, the computation is done similarly with the help of the networkx library. 
2 
The number of edges m is defined as m = p+. This is because for a graph with n nodes, there 


are at most = unique pairs of indirected edges. Multiplying this number by the probability 


p will give us the number of edges m. 


# Python code to visualize a graph 

import networkx as nx 

import numpy as np 
40 Number of nodes 
0.3 probability 


np.round(((n ** 2)/2)*p) # Number of edges 
nx.gnm_random_graph(n,m) # Graph 
nx.adjacency_matrix(G) Adj matrix 

nx. draw (G) Drawing 

p_ML = np.mean(A) ML estimate 


As you can see in both the MATLAB and the Python code, the ML estimate pur is de- 
termined by taking the sample average. Thus the ML estimate, according to our calculation, 


is Pv = NP sae Sei Tig 


8.1.4 Application 2: Reconstructing images 


Being able to see in the dark is the holy grail of imaging. Many advanced sensing technologies 
have been developed over the past decade. In this example, we consider a single-photon image 
sensor. This is a counting device that counts the number of photons arriving at the sensor. 
Physicists have shown that a Poisson process can model the arrival of the photons. For 
simplicity we assume a homogeneous pattern of N pixels. The underlying intensity of the 
homogeneous pattern is a constant A. 

Suppose that we have a sensor with N pixels X,,...,X yj. According to the Poisson 
statistics, the probability of observing a pixel value is determined by the Poisson probability: 


X, ~ Poisson(A), n=1,...,N, 
or more explicitly, 


— Arn 


Ln! ; 


P[Xn = Ln] 


where £,, is the nth observed pixel value, and is an integer. 

A single-photon image sensor is slightly more complicated in the sense that it does 
not report X,, but instead reports a truncated version of X,,. Depending on the number of 
incoming photons, the sensor reports 


i, 2 Ba, 
Y,=47 = (8.10) 
0, X,=0. 


We call this type of sensors a one-bit single-photon image sensor (see Figure 8.10). Our 


question is: If we are given the measurements Xj,...,Xy, can we estimate the underlying 
parameter ? 
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Figure 8.10: A one-bit single-photon image sensor captures an image with binary bits: It reports a “1” 
when the number of photons exceeds certain threshold, and “O” otherwise. The recovery problem here 
is to estimate the underlying image from the measurements. 


Example 8.6. Derive the log-likelihood function of the estimation problem for the 
single-photon image sensors. 


Solution. Since Y,, is a binary random variable, its probability is completely specified 
by the two states it takes: 


P[¥n = 0 
a = Sr Oe 


Thus, Y, is a Bernoulli random variable with probability 1 — e~> of getting a value 
of 1, and probability e~> of getting a value of 0. By defining y, as a binary number 
taking values of either 0 or 1, it follows that the log-likelihood is 


a 1 
log L(A| y) = toe { II (tf e-4)P" (a4) \ 


n=1 


= si {um log(1 — e~*) — A(1 — in) 


n=1 


Practice Exercise 8.4. Solve the ML estimation problem 


Vi argmax log L(A| y). (8.11) 
A 


Solution. First, we define S = Seo Yn. This simplifies the log-likelihood function to 


N 


log (Al) = > {ts lo(d ~ e~>) ~ 0. = vn) } 


n=! 


= Slog(l —e*) — A(N — S$). 
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The ML estimation is 


Ami = argmax Slog(1—e~*) — A(N — S). 
d 


Taking the derivative w.r.t. A yields 


F{ Sow(t =e) — au - 5)} = Fe (N —S). 


Moving around the terms, it follows that 


a (N —S)=0 


l—e> 


Therefore, the ML estimate is 


zs il 
AML = — log (: = 


For real images, you can extrapolate the idea from y,, to y;,;,4, which denotes the (i, 7)th 
pixel located at time t. Defining y, € R“* as the tth frame of the observed data, we can 
use T' frames to recover one image Ay, € RN*%. It follows from the above derivation that 


the ML estimate is - 
a 1 
Amit = — log (: Fr 2, ») ‘ (8.13) 


Figure 8.11 shows a pair of input-output images of a 256 x 256 image. 


(a) Observed data (1-frame) (b) ML estimate (using 100 frames) 


Figure 8.11: ML estimation for a single-photon image sensor problem. The observed data consists of 
100 frames of binary measurements y,,...,yp, where J’ = 100. The ML estimate is constructed by 


A = — log(1 — Fr aan Yt): 


On a computer the ML estimation can be done in a few lines of MATLAB code. The 
code in Python requires more work, as it needs to read images using the openCV library. 
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% MATLAB code to recover an image from binary measurements 
lambda = im2double(imread(’cameraman.tif’)); 
= 100; 100 frames 
= poissrnd( repmat(lambda, [1,1,T]) ); generate Poisson r.v. 
= (x>=1); binary truncation 
lambdahat = -log(1-mean(y,3)); ML estimation 
figure(1); imshow(x(:,:,1)); 
figure(2); imshow(lambdahat) ; 


# Python code to recover an image from binary measurements 
import cv2 
import numpy as np 
import scipy.stats as stats 
import matplotlib.pyplot as plt 
lambd = cv2.imread(’./cameraman.tif’) # read image 
lambd = cv2.cvtColor(lambd, cv2.COLOR_BGR2GRAY)/255 # gray scale 

100 
lambdT = np.repeat(lambd[:, :, np.newaxis], T, axis=2) # repeat image 
x = stats.poisson.rvs(lambdT) # Poisson statistics 
y = (x>=1).astype (float) # binary truncation 
lambdhat = -np.log(1-np.mean(y,axis=2) ) # ML estimation 
plt.imshow(lambdhat , cmap=’ gray’ ) 


8.1.5 More examples of ML estimation 


By now you should be familiar with the procedure for solving the ML estimation problem. 
We summarize the two steps as follows. 


How to solve an ML estimation problem 
e Write down the likelihood £L(0| x). 
e Maximize the likelihood by solving Omi = argmax log £(0| a). 
@ 


Practice Exercise 8.5 (Gaussian). Suppose that we are given a set of iid. Gaus- 


sian random variables X,,...,Xy, where both the mean p and the variance o? are 


unknown. Let 0 = [,07]" be the parameter. Find the ML estimate of 0. 


Solution. We first write down the likelihood. The likelihood of these i.i.d. Gaussian 
random variables is 


N N 
L(6|x) = (=) exp \-a3 So (tn - «| ; 
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To solve the ML estimation problem, we maximize the log-likelihood: 


Oui argmax L(0| x) 
6 


N ee 
= argmax {-5 log(2m07) — aol 2, (en - a : 


b,o7 


Since we have two parameters, we need to take the derivatives for both. 


N 
af{N Peat nie 
| —F eaane )— 397 Dalen #) ao 
N 
rn es ree sl eillael 
ioe | baton )— 958 2, {en —#) ao 


(Note that the derivative of the second equation is taken w.r.t. to 0? and not a.) This 
pair of equations gives us 


- (27) 4 a ic: —p)? =0. 


N 
1 Ss N 
ae 2 {en a 2 Ino? 


Rearranging the equations, we find that 


1 N 
ML = W 27 and 


Practice Exercise 8.6. (Poisson) Given a set of iid. Poisson random variables 
X ,...,Xy with an unknown parameter X, find the ML estimate of X. 


Solution. For a Poisson random variable, the likelihood function is 


cla) =I { 


(j—b 
To solve the ML estimation problem, we note that 


N 
Amn = argmax L(A| a) = argmax log II a 
i d xL 


ov 


ee ae 
= tetas wef mae ; 


es 


Since [],, @n! is independent of A, its presence or absence will not affect the optimization 
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problem. Consequently we can drop the term. It follows that 


ae = argmax log {xX ee 
A 


= argmax 2 on) log A — NA. 
A n 


Taking the derivative and setting it to zero yields 


&{ (S| log A »| = Zaed N=0. 


Rearranging the terms yields 


Me i pee 
AML = WN Sy In: 
7 


The idea of ML estimation can also be extended to vector observations. 


Example 8.7. (High-dimensional Gaussian) Suppose that we are given a set of i.i.d. 
d-dimensional Gaussian random vectors X ,...,Xy\ such that 


X,, ~ Gaussian(p, X). 


We assume that & is fixed and known, but pw is unknown. Find the ML estimate of yp. 
Solution. The likelihood function is 


N 


L(u| {an }nar) = [J fx, (ens w) 


Thus the log-likelihood function is 


N N ee gl 
log L(a|{6n}81) = J log || + loe(2n)" + {Sen — w)PE Men — 1) 
ni 


The ML estimate is found by maximizing this log-likelihood function: 


Purr, = argmax log L(p| ie ae 
rv) 
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Taking the gradient of the function and setting it to zero, we have that 


N 
d aN N : d 1 Ty-l — 
fo | lal + toto +O {3 x ion} <0 


The derivatives of the first two terms are zero because they do not depend on ys). Thus 
we have that: 


n—| 


Rearranging the terms yields the ML estimate 


Hue = zd Ln. 


Example 8.8. (High-dimensional Gaussian) Assume the same problem setting as in 
Example 8.7, except that this time we assume that both the mean vector yw and the 
covariance matrix J are unknown. Find the ML estimate for 0 = (ys, J). 


Solution. The log-likelihood follows from Example 8.7: 


N N 
log L(1| {@n}nar) = > log |B] + = log(27)" 


lle iS eee - wh. 


Finding the ML estimate requires taking the derivative with respect to both wz and &: 


d {N N 1 ee : 
ates {plo — ws ion wh} =o, 


d {N N : 1 Pee _ 
=| yee 9 a 3 (@n — H) ue" (@n w}h ao 


‘nT 
After some tedious algebraic steps (see Duda et al., Pattern Classification, Problem 
3.14), we have that 


(8.17) 


(8.18) 
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8.1.6 Regression versus ML estimation 


ML estimation is closely related to regression. To understand the connection, we consider a 
linear model that we studied in Chapter 7. This model describes the relationship between 


the inputs 21,...,a2y and the observed outputs y1,...,yn, via the equation 
d-1 
Yn = > Opbp(an) + ens n=1,...,N. (8.19) 
p=0 


In this expression, ¢,(-) is a transformation that extracts the “features” of the input vector 

x to produce a scalar. The coefficient 9, defines the relative weight of the feature ¢,(a,,) in 

constructing the observed variable y,,. The error e,, defines the modeling error between the 

observation y,, and the prediction ae 9nGp(n). We call this equation a linear model. 
Expressed in matrix form, the linear model is 


Yi go(t1) 1(@1) +++ ba—1(#1) 60 e1 
yo} do(a2) d1(@2) ++ ba—1(x2) 0, a €2 
e $o(an) oi(an) - - pean) Ba-1 ay 
=y =X =6 =e 


or more compactly as y = X@+4 e. Rearranging the terms, it is easy to show that 
N N d-1 7 
Y= D (so -Ftreren)) 
p=0 


N 
= » (Yn — [X6],.)° = |ly- X6|?. 


Now we make an assumption: that each noise e, is an i.i.d. copy of a Gaussian random 
variable with zero mean and variance o?. In other words, the error vector e is distributed 
according to e ~ Gaussian(0,07I). This assumption is not always true because there are 
many situations in which the error is not Gaussian. However, this assumption is necessary 
for us to make the connection between ML estimation and regression. 

With this assumption, we ask, given the observations y1,...,yn, what would be the 
ML estimate of the unknown parameter 6? We answer this question in two steps. 


Example 8.9. Find the likelihood function of 6, given y = [y1,---,yn]"- 
Solution. The PDF of y is given by a Gaussian: 


N 


fy(y; 9) =|] 


nl 
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Therefore, the log-likelihood function is 


1 1 
joe £(0|y) = We | apo? sally xoi"}} 


N 2 1 2 
= ~ 5 log(2me )- Ig2ll¥ — X@|l?. 


The next step is to solve the ML estimation by maximizing the log-likelihood. 


Example 8.10. Solve the ML estimation problem stated in Example 8.9. Assume that 
XX is invertible. 


Solution. 


Out = argmax log £(4| y) 
@ 
N i 
= argmax 4 —— log(2r0”) — —~||y — X9||? >. 
(;) iD) 2c? 


Taking the derivative w.r.t. 0 yields 


Bl 
d0 


N 2 1 2 
{-} lero Ne ag2lly — Xl \ =i0, 


Since 467 A@ = A + A’, it follows from the chain rule that 


d 
dO 


Substituting this result into the equation, 
x? (xo 
ee ( —y) =0. 
Rearranging terms we obtain X TXO=X Ty, of which the solution is 


Og (XY 


Since the ML estimate in Equation (8.21) is the same as the regression solution (see 
Chapter 7), we conclude that the regression problem of a linear model is equivalent to solving 
an ML estimation problem. 

The main difference between a linear regression problem and an ML estimation problem 
is the underlying statistical model, as illustrated in Figure 8.12. In linear regression, you 
do not care about the statistics of the noise term e,. We choose (-)? as the error because it 
is differentiable and convenient. In ML estimation, we choose (-)? as the error because the 
noise is Gaussian. If the noise is not Gaussian, e.g., the noise follows a Laplace distribution, 
we need to choose | - | as the error. Therefore, you can always get a result by solving the 
linear regression. However, this result will only become meaningful if you provide additional 
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Regression Maximum-Likelihood 

Optimization: Optimization: 

~ ~ 1 1 4 
8 = argmin ly — X6||? Oe eax ecm exp { — rally ~ X6I \ 
Solution: Solution: 


O=(X7X)'XTy GS 6=(XTX) 1XTy 
Assumption: None Assumption: y— X@~ Gaussian(0, 07) 


Figure 8.12: ML estimation is equivalent to a linear regression when the underlying statistical model 
for ML estimation is a Gaussian. Specifically, if the error term e = y — X@ is an independent Gaussian 
vector with zero mean and covariance matrix o7J, then the resulting ML estimation is the same as linear 
regression. If the underlying statistical model is not Gaussian, then solving the regression is equivalent 
to applying a Gaussian ML estimation to a non-Gaussian problem. This will still give us a result, but 
that result will not maximize the likelihood, and thus it will not have any statistical guarantee. 


information about the problem. For example, if you know that the noise is Gaussian, then 
the regression solution is also the ML solution. This is a statistical guarantee. 

In practice, of course, we do not know whether the noise is Gaussian or not. At this 
point we have two courses of action: (i) Use your prior knowledge/domain expertise to 
determine whether a Gaussian assumption makes sense, or (ii) select an alternative model 
and see if the alternative model fits the data better. In practice, we should also question 
whether maximizing the likelihood is what we want. We may have some knowledge and 
therefore prefer the parameter 0, e.g., we want a sparse solution so that 8 only contains a 
few non-zeros. In that case, maximizing the likelihood without any constraint may not be 
the solution we want. 


ML estimation versus regression 


ML estimation requires a statistical assumption, whereas regression does not. 
d-1 


Suppose that you use a linear model y, = Dees Ondp(@n) + en where en ~ 


Catissian la ston — dae 


Then the likelihood function in the ML estimation is 


1 1 
£(0\) = pay ex {pally x01? 


The ML estimate Oe is Ghee = (XX Ny, which is exactly the same as 
the regression solution. If the above statistical assumptions do not hold, then the 
regression solution will not maximize the likelihood. 
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8.2 Properties of ML Estimates 


ML estimation is a very special type of estimation. Not all estimations are ML. If an estimate 
is ML, are there any theoretical properties we can analyze? For example, will ML estimates 
guarantee the recovery of the true parameter? If so, when will this happen? In this section 
we investigate these theoretical questions so that you will acquire a better understanding of 
the statistical nature of ML estimates.” 


8.2.1 Estimators 
We know that an ML estimate is defined as 


Our, (w) = argmax L(O| x). (8.22) 
6 


We write Oy, (x) to emphasize that Pua is a function of x. The dependency of Our, (x) on 
x should not be a surprise. For example, if the ML estimate is the sample average, we have 


that 
2 ee 
Omi (@1,---,2Nn) = N27 


where 2 = [11,...,2N]"- 
However, in this setting we should always remember that x1,...,xy are realizations 
of the iid. random variables X,,..., Xj. Therefore, if we want to analzye the randomness 


of the variables, it is more ceasonable to write Bie as a random variable Ou: For example, 
in the case of sample average, we have that 


N 
2 1 
Omi(X1,...,Xw) = lL kn (8.23) 


We call Out the ML estimator of the true parameter 0. 


Estimate versus estimator 


n~ 


N 
1 
e An estimate is a number, e.g., Ou, = W 22 py. It is the random realization of 


a random variable. 


e An estimator is a random variable, e.g., Our = 7% It takes a set of 


random variables and generates another random Tenable 


?For notational simplicity, in this section we will focus on a scalar parameter @ instead of a vector 
parameter 0. 
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The ML estimators are one type of estimator, namely those that maximize the likeli- 
hood functions. If we do not want to maximize the likelihood we can still define an estimator. 
An estimator is any function that takes the data points X1,...,X,) and maps them to a 
number (or a vector of numbers). That is, an estimator is 


nxn 


Oia): 


We call © the estimator of the true parameter 0. 


Example 8.11. Let X1,..., Xn be Gaussian i.i.d. random variables with unknown 
mean @ and known variance o?. Construct two possible estimators. 


Solution. We define two estimators: 


nxn 


Cie ines 


o~ 


Bi Gives 


In the first case, the estimator takes all the samples and constructs the sample average. 
The second estimator takes all the samples and returns on the first element. Both are 
legitimate estimators. However, 0; is the ML estimator, whereas Q2 is not. 


8.2.2 Unbiased estimators 


While you can define estimators in any way you like, certain estimators are good and others 
are bad. By “good” we mean that the estimator can provide you with the information about 
the true parameter 6; otherwise, why would you even construct such an estimator? However, 
the difficulty here is that O is a random variable because it is constructed from X1,..., Xn. 
Therefore, we need to define different metrics to quantify the usefulness of the estimators. 


Definition 8.5. An estimator © is unbiased af 


Unbiasedness means that the average of the random variable © matches the true 
parameter 6. In other words, while we allow O to fluctuate, we expect the average to match 
the true 6. If this is not the case, using more measurements will not help us get closer to @. 


Example 8.12. Let X,,..., Xy be i.i.d. Gaussian random variables with a unknown 
mean 9. It has been shown that the ML estimator is 


1 N 
Om = 7 d_ Xo. (8.25) 


Is the ML estimator Oui unbiased? 
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Solution: To check the unbiasedness, we look at the expectation: 


as pee ees 
[Oma] = 7D EXn] = yD 9 = 8. 
(ail 


Thus, Our = x ee X,, is an unbiased estimator of 6. 


Example 8.13. Same as the example before, but this time we consider an estimator 
6 =X, + X24+5. (8.26) 


Is this estimator unbiased? 


Solution: In this case, 


LX. + X2 +5] = ) ([Xq]+5=20+548. 


Therefore, the estimator is biased. 


Example 8.14. Let Xj,...,Xy be iid. Gaussian random variables with unknown 
mean p and unknown variance o?. We have shown that the ML estimators are 


ie ee 
imi = NW S- Xn amd Ong = NW Paes — fin). 
n=1 


m—il 


It is easy to show that E[fim.] = . How about G2,,,? Is it an unbiased estimator? 


Solution: For simplicity we assume js = 0 so that E[X?] = E[(X, — 0)?] = o?. 
Note that 


322] = 3[X2] —2 {faulS 


hee ; 
=|(7o%) 


By independence, we observe that ELX; = E[X,]E[X,,] = 0, for any 7 4 n. There- 
fore, 


1 N 
| — S0 XjXn 
j=l 
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Similarly, we have that 


1 2 1 & 
| (a) = yr DV ELXa] + DEX; Xn] 


n=1 


[Fh] = 


which is not equal to o?. Therefore, G%,;, is a biased estimator of o?. 


In the previous example, it is possible to construct an unbiased estimator for the 
variance. To do so, we can use 


N 
1 
a2 A 2 
Aunbias — 5 (Xn iim) ; (8.27) 
N-1 
n=1 
so that E[G? ins] = 77. However, note that G?,,,,,, does not maximize the likelihood, so while 


you can get unbiasedness, you cannot maximize the likelihood. If you want to maximize the 
likelihood, you cannot get unbiasedness. 


What is an unbiased estimator? 


e An estimator © is unbiased if E[6] = 6. 


e Unbiased means that the statistical average of O is the true parameter 0. 


e If X, ~ Gaussian(6, 02), then 6 = (1/N) )>*_, Xp, is unbiased, but © = X; is 
biased. 


8.2.3. Consistent estimators 


By definition, an estimator O(X1, ...,Xy) is a function of N random variables X,,..., Xy-. 
Therefore, O(X1,..., Xj) changes as N grows. In this subsection we analyze how © behaves 
when N changes. For notational simplicity we use the following notation: 

Oy = O0(X,...,Xy). (8.28) 
Thus, as N increases, we use more random variables in defining O(x Lp: ++) XN). 
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Definition 8.6. An estimator On is consistent if On = 0, 1.€., 


hea P| [Bw -9 >¢ =i 


N—- co 


The definition here follows from our discussions of the law of large numbers in Chapter 6. 
The specific type of convergence is known as the convergence in probability. It says that 
as N grows, the estimator © will be close enough to @ so that the probability of getting a 
large deviation will diminish, as illustrated in Figure 8.13. 


1.2 —7— —— 7 oF 1.2 
1} |  4t 
0.8 | ; 0.8; 
0.6 | ; 0.6; 
0.4; ; 0.4; 
0.2; / \ 7 0.2; 

0 —___.___™ 0 
5 4 -3 -2 -1 01 2 3 4 «5 5 

(a) N=1 

1.2 a 1.2 

1 1 
0.8 | 
0.6 | 

0.4 

0.2 


0 0 
5 4 3 -2 -1 01 2 3 4 «5 5 4 3 -2 -1 01 2 3 «4 «5 
(c) N=4 (4) N=8 


Figure 8.13: The four subfigures here illustrate the probability of error P[|On —6| > €], which is 
represented by the areas shaded in blue. We assume that the estimator On is a Gaussian random 
variable following a distribution Gaussian(0, a), where we set o = 1. The threshold we use in this 
figure is € = 1. As N grows, we see that the probability of error diminishes. If the probability of error 


goes to zero, we say that the estimator is consistent. 
The examples in Figure 8.13 are typical situations for an estimator based on the 


sample average. For example, if we assume that X1,...,Xy are i.i.d. Gaussian copies of 
Gaussian(0, 07), then the estimator 


~ 1 
O(Xi,...,Xv) ==>) Xn 


will follow a Gaussian distribution Gaussian(0, -. (Please refer to Chapter 6 for the deriva- 


tion.) Then, as N grows, the PDF of 6 n becomes narrower and narrower. For a fixed e, it 
follows that the probability of error will diminish to zero. In fact, we can prove that, for this 
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example, 


P| [8x -0 >] =P[Sx-0>¢ -P|6x 0< ( 


co 2 0-—«€ 2 
= / Gaussian (: | 0, =) dz+ Gaussian (: | 0, =) dz 
6 


—co 


ee) 1 _ (0)? 0-e 1 _ (0)? 
=} e 207/N dz- } ~~ 262/N dz 


O+e \/2707/N so 3 


Stee aan) 
»(— a). 


Therefore, as N > oo, it holds that a ana — —oo. Hence, 


N- oo N-+0o 


lim P| [Bw ~4 >¢ = Jim 26 (=a) = 


This explains why in Figure 8.13 the probability of error diminishes to zero as N grows. 
Therefore, we say that Oy is consistent. 
In general, there are two ways to check whether an estimator is consistent: 


e Prove convergence in probability. This is based on the definition of a consistent 
estimator. If we can prove that 


wim P[lOn —6| >] =0, (8.30) 


then we say that the estimator is consistent. 


e Prove convergence in mean squared error: 


lim E[(On — 6)2] =0. (8.31) 


N-0o 


To see why convergence in the mean squared error is sufficient to guarantee consistency, 
we recall Chebyshev’s inequality in Chapter 6, which says that 


(Ow = 0)7) 


e2 


P[|On | €] < 


Thus, if limy—oo z[(O n — 9)?] = 0, convergence in probability will also hold. How- 
ever, Since mean square convergence is stronger than convergence in probability, being 
unable to show mean square convergence does not imply that an estimator is incon- 
sistent. 


Be careful not to confuse a consistent estimator and an unbiased estimator. The two 
are different concepts; one does not imply the other. 
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e Consistent = If you have enough samples, then the estimator 6 will converge to 


the true parameter. 


e Unbiasedness does not imply consistency. For example (Gaussian), if 


6=X, 


then E[X;] = yw. But P{|O — | > €] does not converge to 0 as N grows. So this 
estimator is inconsistent. (See Example 8.16 below.) 


e Consistency does not imply unbiasedness. For example (Gaussian), if 


fie 
5-1 Sy _ 2 
ie) N 2% LL) 


is a biased estimate for variance, but it is consistent. (See Example 8.17 below.) 


Example 8.15. Let X),..., Xy bei.i.d. Gaussian random variables with an unknown 
mean jp and known variance 0”. We know that the ML estimator for the mean is 
imi = (1/N) ae X,. Is Zip consistent? 


Solution. We have shown that the ML estimator is 


LML = 


P\|fimn — | >] < 


1 N 
W 2X 


[fim] = p, and E[(five — 1)?] = Var[fine] = %, it follows that 


[im — 1)?] 


Thus, when N goes to infinity, the probability converges to zero, and hence the esti- 


mator is consistent. 


Example 8.16. Let X,,..., Xj bei.i.d. Gaussian random variables with an unknown 
mean js and known variance o?. Define an estimator jf = X,. Show that the estimator 


is unbiased but inconsistent. 


Solution. We know that E (|X1] = p. So pf is an unbiased estimator. However, 


we can show that 


(@- #)] = 


[Xa — p)?] = 0°. 


Since this variance E[(f#— 1)?] does not shrink as N increases, it follows that no matter 
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how many samples we use we cannot make E[(ji — 1)?| go to zero. To be more precise, 


P| al > ¢| =P[Ixi- nl > 


=P[x: <u- +P)m ante 


ioral _ (en)? ioe 1 _ (e=n)? 
e202 dx+ e202 dx 
U 


= Qro2 
(2) 
or 


which does not converge to zero as N — oo. So the estimator is inconsistent. 


Example 8.17. Let X,,..., Xj bei.i.d. Gaussian random variables with an unknown 
mean j and an unknown variance 07. Is the ML estimate of the variance, i.e., Fr, 
consistent? 


Solution. We know that the ML estimator for the mean is 
je 
HML = a; > Gop 


and we have shown that it is an unbiased and consistent estimator of the mean. For 
the variance, 


N 
1 on 
Cun = — WV S (Xn — jivn.)* 


n=1 


Note that a a X? is the sample average of the second moment, and so by the 
weak law of large numbers it should converge in probability to E[X2]. Similarly, fi 
will converge in probability to yu. Therefore, we have 


Out = oe fim — > (oy) 
n=l 


Thus, we have shown that the ML estimator of the variance is biased but consistent. 
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The following discussions about the consistency of ML estimators can be skipped. 


As we have said, there are many estimators. Some estimators are consistent and some 
are not. The ML estimators are special. It turns out that under certain regularity conditions 
the ML estimators of i.i.d. observations are consistent. 

Without proving this result formally, we highlight a few steps to illustrate the idea. 
Suppose that we have a set of i.i.d. data points x,,...,x, drawn from some distribution 
f(x, | @truec). To formulate the ML estimation, we consider the log-likelihood function (di- 
vided by N): 


— + los £( (0| x) os log f(@n; 4). (8.32) 


Here, the variable @ is unknown. We need to find it by maximizing the log-likelihood. 
By the weak law of large numbers, we can show that the log-likelihood based on the 
N samples will converge in probability to 


N 
108 Flan; 8) 2 Blog F(a; 8)] (8.33) 


a 
gn (9) 


The expectation can be evaluated by integrating over the true distribution: 


sllog f(w; 6)] = / log f(@; 8)» f(a; Osrue)de. 
ee SS 
g() 


where f(x; Otrue) denotes the true distribution of the samples z,,’s. From these two results 
we define two functions: 


N 
gx (0) Y tog f(ens 8), and 9(6) * f tog fle; 8) - f(s Bove) 


n=1 


and we know that gv(@) "> g(@). 
We also know that Oy, is the ML estimator, and so 


Own. =argmax gn(Q@). 
6 


Let 6* be the maximizer of the limiting function, i.e., 


6* = argmax g(8). 
6 


Because gv(@) ~ g(@), we can (loosely? ) argue that Oy, 7 @*. If we can show that 


O* = Oirue, then we have shown that Ou. 4 Otrue, implying that Ou. is consistent. 


3To rigorously prove this statement we need some kind of regularity conditions on gy and g. A more 
formal proof can be found in H. Vincent Poor, An Introduction Signal Detection and Estimation, Springer, 
1998, Section IV.D. 
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To show that 0* = Otrue, we note that 


d 
76 Jrssle: @)- f(x; Otrue) —— 0) - f(x; Otrue) dx 
-{Fe3 (x; Otrue) dx. 


We ask whether this is equal to zero. Putting 0 = Otrue, we have that 


c x; Re 


(a; f (@; Otrne) ue) 


f(a; Dt rue) dx = [re Otruc) dx. 


However, this integral can be simplified to 


d 
[re true) dx = Gf ile: 0) dx 
ee +-[{—"“‘“" 


=1 


= 0. 
6=Otruc 


Therefore, Otruc is the maximizer for g(@), and so Oirue = 0". 


End of the discussion. Please join us again. 


8.2.4 Invariance principle 


Another useful property satisfied by the ML estimate is the invariance principle. The in- 
variance principle says that a monotonic transformation of the true parameter is preserved 
for the ML estimates. 


What is the invariance principle? 


e There is a monotonic function h. 
e There is an ML estimate Cun for 0. 
e The monotonic function h maps the true parameter 6 +> h(6). 


e Then the same function will map the ML estimate Gare i h(Ouar)- 


The formal statement of the invariance principle is given by the theorem below. 


Theorem 8.1. [f Dae is the ML estimate of 6, then for any one-to-one function h 


of 0, the ML estimate of h(@) is h(@uz). 


Proof. Define the likelihood function £(@) (we have dropped «x to simplify the notation). 
Then, for any monotonic function h, we have that 
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Let Onit, be the ML estimate: 


uc. = argmax £(0) = argmax L(h~'(h(0))). 
0 6 


By the definition of ML, Oy, must maximize the likelihood. Therefore, L(h-1(h(0))) is 
maximized when h~!(h(@)) = Oyu, This implies that h(@) = (Ont) because h is monotonic. 
Since h(@) is the parameter we try to estimate, the equality h(@) = h(Oyrt) implies that 
h(Oyn) is the ML estimate of h(6). 


Example 8.18. Consider the single-photon image sensor example we discussed in 
Section 8.1. We consider a set of i.i.d. Bernoulli random variables with PMF 


px,(1)=1-e" and px,(0)=e". (8.34) 


Find the ML estimate through (a) direct calculation and (b) the invariance principle. 


Solution. (a) Following the example in Equation (8.12), the ML estimate of 77 is 


N 


7ML = argmax Il (1 = eT)" (ele 


ui nil 


1 N 


(b) We can obtain the same result using the invariance principle. Since X,, is a 
binary random variable, we assume that it is a Bernoulli with parameter @. Then the 
ML estimate of @ is 


N 
Gin = argmax Il OP" (1 — Gye 


8 n=1 


i; N 
= NW Dt 


The relationship between @ and 77 is that 6 = 1— e~", or 7 = —log(1 — 8). So we let 
h(@) = —log(1 — 0). The invariance principle says that the ML estimate of h(6) is 


Bs det =arr. 
mt = h(O) ver 


Gun) 
iI N 
Sette so. 
ce woe): 


where (i) follows from the invariance principle. 


The invariance principle can be very convenient, especially when the transformation h is 
complicated, so that a direct evaluation of the ML estimate is difficult. 
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log L(n|S L 20) 


Truncated Poisson 
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l = 
, ! og £(6|S = ko) 
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Figure 8.14: The invariance principle is a transformation of the ML estimate. In this example, we 
consider a Bernoulli log-likelihood function shown in the lowermost plot. For this log-likelihood, the ML 
estimate is Om. = 0.4. On the left-hand side we show another log-likelihood, derived for a truncated 
Poisson random variable. Note that the ML estimate is 7m_ = 0.5108. The invariance principle asserts 
that, instead of computing these ML estimates directly, we can first derive the relationship between 7 
and 6 for any 0. Since we know that 6 = 1—e™", it follows that 7 = —log(1 — 0). We define this 
transformation as 7 = h(@) = —log(1— 40). Then the ML estimate is 7jm_ (Ou) h(0.4) = 0.5108. 
The invariance principle saves us the trouble of computing the maximization of the more truncated 
Poisson likelihood. 


The invariance principle is portrayed in Figure 8.14. We start with the Bernoulli log- 
likelihood 


log £L(0|S) = Slog é + (1 — S) log(1 — 6). 


In this particular example we let S = 20, where S denotes the sum of the N = 50 Bernoulli 
random variables. The other log-likelihood is the truncated Poisson, which is given by 


log L(y|S) = Slog(1 — e~") + (1 — S) log(e~"). 
The transformation between the two is the function 
7 = h(0) = —log(1 — 6). 


Putting everything into the figure, we see that the ML estimate (@ = 0.4) is translated to 
7 = 0.5108. The invariance principle asserts that this calculation can be done by 7jm_ = 
h(@ui) = h(0.4) = —0.5108. 
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8.3. Maximum A Posteriori Estimation 


In ML estimation, the parameter @ is treated as a deterministic quantity. There are, however, 
many situations where we have some prior knowledge about @. For example, we may not 
know exactly the speed of a car, but we may know that the speed is roughly 65 mph 
with a standard deviation of 5 mph. How do we incorporate such prior knowledge into the 
estimation problem? 

In this section, we introduce the second estimation technique, known as the maximum 
a posteriori (MAP) estimation. MAP estimation links the likelihood and the prior. The key 
idea is to treat the parameter @ as a random variable (vector) © with a PDF fe(@). 


8.3.1 The trio of likelihood, prior, and posterior 


To understand how the MAP estimation works, it is important first to understand the role 
of the parameter 0, which changes from a deterministic quantity to a random quantity. 
Recall the likelihood function we defined in the ML estimation; it is 


L(O|x) = fx (a; @), 


if we assume that we have a set of iid. observations x = [x1,...,n]’. By writing the PDF 
of X as fx(ax; 0), we emphasize that 0 is a deterministic but unknown parameter. There 
is nothing random about 6. 

In MAP, we change the nature of 0 from deterministic to random. We replace 8 by © 
and write 

fx(a; 0) "23° fyio(x|6). (8.35) 

The difference between the left-hand side and the right-hand side is subtle but important. 
On the left-hand side, fx (x; 0) is the PDF of X. This PDF is parameterized by 0. On the 
right-hand side, fx)@(x|@) is a conditional PDF of X given ©. The values they provide 
are exactly the same. However, in fx|@(#|@), @ is a realization of a random variable ©. 

Because © is now a random variable (vector), we can define its PDF (yes, the PDF 
of ©), and denote it by 


fo(9), (8.36) 


which is called the prior distribution. The prior distribution of © is unique in MAP estima- 
tion. There is nothing called a prior in ML estimation. 

Multiplying fxje(#|@) with the prior PDF fe(@), and using Bayes’ Theorem, we 
obtain the posterior distribution: 


fx\e(2|9) fe(9) 
6|z) = ——__——_-. 8.37 
fox (Ble) = (8.37) 
The posterior distribution is the PDF of © given the measurements X. 
The likelihood, the prior, and the posterior can be confusing. Let us clarify their mean- 
ings. 
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e Likelihood fx)|@(x|@): This is the conditional probability density of X given the pa- 
rameter ©. Do not confuse the likelihood fx\@(«#|@) defined in the MAP context 
and the likelihood fx(a;|@) defined in the ML context. The former assumes that © 
is random whereas the latter assumes that @ is deterministic. They have the same 
values. 

Prior fo(@): This is the prior distribution of ©. It does not come from the data X 
but from our prior knowledge. For example, if we see a bike on the road, even before 
we take any measurement we will have a rough idea of its speed. This is the prior 
distribution. 


Posterior fe, x (0|x): This is the posterior density of © given that we have observed X. 
Do not confuse fe) x (A|x) and L(@| x). The posterior distribution fe) x (A|x) is a PDF 
of © given X = x. The likelihood £(@| a) is not a PDF. If you integrate fe) x (A|x) 
with respect to 6, you get 1, but if you integrate £(@|a) with respect to 0, you do 
not get 1. 


What is the difference between ML and MAP? 


Likelihood ML fx(x; 0) The parameter 6 is deterministic. 
MAP fx \@(x | @) The parameter © is random. 


There is no prior, because @ is deterministic. 
foe (9) This is the PDF of 0. 


Optimization ML Find the peak of the likelihood fx (a; @). 
MAP Find the peak of the posterior fe; x (8 | x). 


Maximum a posteriori (MAP) estimation is a form of Bayesian estimation. Bayesian 
methods emphasize our prior knowledge or beliefs about the parameters. As we will see 
shortly, the prior has something valuable to offer, especially when we have very few data 
points. 


8.3.2 Understanding the priors 


Since the biggest difference between MAP and ML is the addition of the prior fe(@), we 
need to take a closer look at what they mean. In Figure 8.15 below, we show a set of six 
different priors. We ask two questions: (1) What do they mean? (2) Which one should we 
use? 


What does the shape of a prior tell us? 


It tells us your belief as to how the underlying parameter © should be distributed. 
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| (a) | | (b) (c) 
a (d) © (e) (f) 


Figure 8.15: This figure illustrates six different examples of the prior distribution f@(@), when the prior 
is a 1D parameter 6. The prior distribution fo(@) is the PDF of ©. (a) fo(@) = 6(0), which is a delta 
function. (b) fo(@) = ;+ for a < 6 < b. This is a uniform distribution. (c) This is also a uniform 
distribution, but the spread is very wide. (d) fo(@) = Gaussian(0, 07), which is a zero-mean Gaussian. 
(e) The same Gaussian, but with a different mean. (f) A Gaussian with zero mean, but a large variance. 


The meaning of this statement can be best understood from the examples shown in Fig- 
ure 8.15: 


e Figure 8.15(a). This is a delta prior fo(0) = 5(0) (or fo(0) = 6(0—O0)). If you use this 
prior, you are absolutely sure that the parameter O takes a specific value. There is no 
uncertainty about your belief. Since you are so confident about your prior knowledge, 
you will ignore the likelihood that is constructed from the data. No one will use a 
delta prior in practice. 


e Figure 8.15(b). fo(#) = +, for a < 6 < , and is zero otherwise. This is a bounded 
uniform prior. You do not have any preference for the parameter 0, but you do know 
from your prior experience that a < O < b. 


e Figure 8.15(c). This prior is the same as (b) but is short and very wide. If you use 
this prior, it means that you know nothing about the parameter. So you give up the 
prior and let the likelihood dominate the MAP estimate. 


e Figure 8.15(d). fo(@) = Gaussian(0, 07). You use this prior when you know something 
about the parameter, e.g., that it is centered at certain location and you have some 
uncertainty. 


e Figure 8.15(e). Same as (d), but the parameter is centered at some other location. 


e Figure 8.15(f). Same as (d), but you have less confidence about the parameter. 


As you can see from these examples, the shape of the prior tells us how you want O to be 
distributed. The choice you make will directly influence the MAP optimization, and hence 
the MAP estimate. 

Since the prior is a subjective quantity in the MAP framework, you as the user have 
the freedom to choose whatever you like. For instance, if you have conducted a similar 
experiment before, you can use the results of the previous experiments as the current prior. 
Another strategy is to go with physics. For instance, we can argue that @ should be sparse 
so that it contains as few non-zeros as possible. In this case, a sparsity-driven prior, such 
as fo(@) = exp{—||9||1}, could be a choice. The third strategy is to choose a prior that is 
computationally “friendlier”, e.g., in quadratic form so that the MAP is differentiable. One 
such choice is the conjugate prior. We will discuss this later in Section 8.3.6. 
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Which prior should we choose? 


e Based on your preference, e.g., you know from historical data that the parameter 
should behave in certain ways. 


e Based on physics, e.g., the parameter has a physical interpretation, so you need 


to abide by the physical laws. 


e Choose a prior that is computationally “friendlier”. This is the topic of the 
conjugate prior, which is a prior that does not change the form of the posterior 
distribution. (We will discuss this later in Section 8.3.6.) 


8.3.3, MAP formulation and solution 


Our next task is to study how to formulate the MAP problem and how to solve it. 


Definition 8.7. Let X = [Xi,...,Xwn]" be i.i.d. observations. Let © be a random 
parameter. The maximum-a-posteriori estimate of © is 


Ouap = oes fo|x (9|x). (8.38) 


Philosophically speaking, ML and MAP have two different goals. ML considers a para- 
metric model with a deterministic parameter. Its goal is to find the parameter that maximizes 
the likelihood for the data we have observed. MAP also considers a parametric model but 
the parameter © is random. Because © is random, we are finding one particular state @ of 
the parameter © that offers the best explanation conditioned on the data X we observe. 
In a sense, the two optimization problems are 


Ou. See fx\e(2|9), 
Drikn — mee fox (9|x). 


This pair of equations is interesting, as the pair tells us that the difference between the ML 
estimation and the MAP estimation is the flipped order of X and ©. 

There are two reasons we care about the posterior. First, in MAP the posterior allows 
us to incorporate the prior. ML does not allow a prior. A prior can be useful when the 
number of samples is small. Second, maximizing the posterior does have some physical 
interpretations. MAP asks for the probability of © = 0 after observing N training samples 
X = «. ML asks for the probability of observing X = x given a parameter 8. Both are 
correct and legitimate criteria, but sometimes we might prefer one over the other. 

To solve the MAP problem, we notice that 


Ouap = argmax fejx (A|x) 
@ 


fx\e(2|8) fo(9) 


= argmax 
* fx (2) 
= mean fx\o(2|@) fo(9), fx (a) does not contain 6 


= argmax log fxjo(2/@) + log fo(8). 
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Therefore, what MAP adds is the prior log fe@(9). If you use an uninformative prior, e.g., a 
prior with extremely wide support, then the MAP estimation will return more or less the 
same result as the ML estimation. 


When does MAP = ML? 


e The relation “=” does not make sense here, because @ is random in MAP but 
deterministic in ML. 


e Solution of MAP optimization = solution of ML optimization, when fe(@) is 
uniform over the parameter space. 


e In this case, fo(@) = constant and so it can be dropped from the optimization. 


Example 8.19. Let X1,...,Xy be iid. random variables with a PDF fx, j}6(#n|0) 
for all n, and © be a random parameter with PDF fo(6): 


fx,,\e(@n|0) = 


1 
V 2102 sd { 


exp { 


Find the MAP estimate. 
Solution. The MAP estimate is 


V 2702 


N 
Omap = argmax II 
8 (jk 


= argmax 
) 


( : ) Ne ater: 
—_ x 
V 210? / 2106 : 


Since the maximizer is not changed by any monotonic function, we apply logarithm 
to the above equations. This yields 


n=1 


z N 1 
Ouap = argmax { = log (2707) — 5 log(270¢) 
p 


N 
Sy (Si ae 0)? (9 = Lo)? 
a 2o02 208 ; 


Constants in the maximization do not matter. So by dropping the constant terms we 
obtain 


2 
206 


N 
e SS (tn — 9)? — (0 — Ho)” 
OMAP = ic ae a G2 . (8.39) 


It now remains to solve the maximization. To this end we take the derivative w.r.t. 6 
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and show that 


This yields 


Rearranging the terms gives us the final result: 


N 2 
of (Dt an) + Feo 


2 oa? 
OO+ WT 


OMAP = 


Practice Exercise 8.7. Prove that if fe(@) = 6(@—O 9), the MAP estimate is Ouae = 
Oo. 


Solution. If fe(@) = 6(@ — 80), then 


Ouap = grees log fx\e(x|@) + log fo(@) 


=argmax log fx\@(x|@) + log 6(6 — Ao) 
6 


argmax log fx|@(#|@) — 00, O04 Oo. 
0 


argmax log fx\@(#|@) + 0, 0 = 60. 
0 


Thus, if Cie # 9, the first case says that there is no solution, so we must go with 
the second case 0y4 AP = 00. But if Omap = 90, there is no optimization because we 
have already chosen O0yap = 90. This proves the result. 


8.3.4 Analyzing the MAP solution 


As we said earlier, MAP offers something that ML does not. To see this, we will use the 
result of the Gaussian random variables as an example and analyze the MAP solution as 
we change the parameters N and oo. Recall that if Xy,...,Xy are iid. Gaussian random 
variables with unknown mean @ and known variance a, the ML estimate is 


a 1 cal 
Omi = N Ss. 


n=1 
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Assuming that the parameter © is distributed according to a PDF Gaussian(ju9, 02), we 
have shown in the previous subsection that 


2(15°N o ~ 2 
Foy = TU Emits) + HMO _ on + $F 
MAP = 5 = - : 
2 g 2 of 
oot N Ot KN 


In what follows, we will take a look at the behavior of the MAP estimate Oxtap as N 
and oo change. The results of our discussion are summarized in Figure 8.16. 


MAP Omap fewer sample ee ea MAP @map 
N L prior 
00 t more confident 
N t prior 
more sample 00 
ML Ou Prior /40 ML Gai. Prior /4O 
(a) Effect of N (b) Effect of oo 


Figure 8.16: The MAP estimate Omap swings between the ML estimate Ou and the prior jug. (a) When 
N increases, the likelihood is more reliable and so we lean towards the ML estimate. If N is small, we 
should trust the prior more than the ML estimate. (b) When oo decreases, we become more confident 
about the prior and so we will use it. If oo is large, we use more information from the ML estimate. 


First, let’s look at the effect of N. 


How does NV change Ont 


e As N + o, the MAP estimate ee > Oye If we have enough samples, we 
trust the data. 


e As N + 0, the MAP estimate Rope — 60. If we do not have any samples, we 
trust the prior. 


These two results can be demonstrated by taking the limits. As N > oo, the MAP estimate 
converges to 


a 24 + a za 
—0o 


This result is not surprising. When we have infinitely many samples, we will completely 
rely on the data and make our estimate. Thus, the MAP estimate is the same as the ML 
estimate. 

When N -> 0, the MAP estimate converges to 


29. o 
_——, . Oui + F Ho 
lim @\yjap = lim ee NE = 


8.42 
N-0 N30 of + 2 a 


This means that, when we do not have any samples, the MAP estimate Our ap will completely 
use the prior distribution, which has a mean Uo. 
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The implication of this result is that MAP offers a natural swing between Burr: and 00, 
controlled by N. Where does this N come from? If we recall the derivation of the result, we 
note that the N affects the likelihood term through the number of samples: 


N 


~ - (an — 0)? (0-49)? 
OMAP = eee { d, 62 3a ; 
n= — 
1 term 


N terms here 


Thus, as N increases, the influence of the data term grows, and so the result will gradually 
shift towards Oy. 

Figure 8.17 illustrates a numerical experiment in which we draw N random samples 
X1,-...,ey according to a Gaussian distribution Gaussian(0, 07), with o = 1. We assume 
that the prior distribution is Gaussian(j19, 02), with uo = 0 and op = 0.25. The ML estimate 
of this problem is = a x ee Xn, whereas the MAP estimate is given by Equation (8.40). 
The figure shows the resulting PDFs. A helpful analogy is that the prior and the likelihood 
are pulling a rope in two opposite directions. As N grows, the force of the likelihood increases 
and so the influence becomes stronger. 


likelihood 

prior likelihood ing Prior Oot find 
© “t ae 0. Canton 
15 | | 0.45; MAP | 


-1 0 4 2 3 4 5 6 7 3 «6-2 
(b) N = 50 


Figure 8.17: The subfigures show the prior distribution fe() and the likelihood function fx\o(x|9), 
given the observed data. (a) When N = 1, the estimated posterior distribution fo) x (@|a) is pulled 
towards the prior. (b) When N = 50, the posterior is pulled towards the ML estimate. The analogy for 
the situation is that each data point is acting as a small force against the big force of the prior. As N 
grows, the small forces of the data points accumulate and eventually dominate. 


We next look at the effect of ao. 


How does oy change Oya~p? 


e Asog > &, the MAP estimate ay Ap > Ove If we have doubts about the prior, 
we trust the data. 


e As oo > 0, the MAP estimate Gace — @. If we are absolutely sure about the 
prior, we ignore the data. 
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When oo > ov, the limit of Bu Ap is 


27. o? 
a & . %OuL + FHo 
lim @yap = lim sai ae A 


oo—700 oo—00 2 a? 
00 + N 


= Out. (8.43) 


The reason why this happens is that oo is the uncertainty level of the prior. If ao is high, 
we are not certain about the prior. In this case, MAP chooses to follow the ML estimate. 
When oo — 0, the limit of O)\ap is 


(8.44) 


Note that when o9 — 0, we are essentially saying that we are absolutely sure about the 
prior. If we are so sure about the prior, there is no need to look at the data. In that case 
the MAP estimate is uo. 

The way to understand the influence of oo is to inspect the equation: 


3 (tn —9)* (8 —Ho)? 

OMAP = argmax Qo2 5) 5) 4 
0 n=1 ' "0 

fixed w.r.t. 0 changes with oo 


Since go is purely a preference you decide, you can control how much trust to put onto the 
prior. 


prior likelihood prior likelihood 


a eet 


1 #0 1 2 3 4 5 6 TFT 3 2 
(a) 00 = 0.1 (b) 00 = 1 


Figure 8.18: The subfigures show the prior distribution f@(@) and the likelihood function fx\o(x|9), 
given the observed data. (a) When ao = 0.1, the estimated posterior distribution fo)x (@|a) is pulled 
towards the prior. (b) When oo = 1, the posterior is pulled towards the ML estimate. An analogy for 
the situation is that the strength of the prior depends on the magnitude of oo. If oo is small the prior 
is strong, and so the influence is large. If oo is large the prior is weak, and so the ML estimate will 
dominate. 


Figure 8.18 illustrates a numerical experiment in which we compare oo = 0.1 and 
09 = 1. If oo is small, the prior distribution fo(0) becomes similar to a delta function. We 
can interpret it as a very confident prior, so confident that we wish to align with the prior. 
The situation can be imagined as a game of tug-of-war between a powerful bull and a horse, 
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which the bull will naturally win. If oo is large the prior distribution will become flat. It 
means that we are not very confident about the prior so that we will trust the data. In this 
case the MAP estimate will shift towards the ML estimate. 


8.3.5 Analysis of the posterior distribution 


When the likelihood is multiplied with the prior to form the posterior, what does the poste- 
rior distribution look like? To answer this question we continue our Gaussian example with 
a fixed variance o and an unknown mean @. The posterior distribution is proportional to 


feta ae x fxjo(l4) fo(8) 
(Qe _ 6)? 


[IT eee { a | eon male (8.45) 


Performing the multiplication and completing the squares, 


3 cl ; (0 = bo)? _ (6 — Omtap)2 


= 202 20a 
where 
20. a? 
~ a9 Oui + F Ho 1 1 ON 
Omap = —— 2, and 9S SH HS. (8.46) 
oO + 'N Omap 9% 9& 


In other words, the posterior distribution fo) x (9|@) is also a Gaussian with 
fox (0|x) = Gaussian(@map, yap): 


If fxjo(x|@) = Gaussian(x; 0,0), and fe(@) = Gaussian(9; 10,05), what is the 
posterior fox (A|x)? 


The posterior fe; x (8|x) is Gaussian (O\ap, G2rap), Where 


27. Z 
7 9mi + FLO 


2 a? 
+ WwW 


OMAP = 


a 


The posterior tells us how N and go will influence the MAP estimate. As N grows, 
the posterior mean and variance becomes 


lim OMAP = Ou = 0, and lim OMAP = 0. 
N-0o Noo 


As a result, the posterior distribution fe) x (O|a) will converge to a delta function centered 
at the ML estimate Out Therefore, as we try to solve the MAP problem by maximizing the 
posterior, the MAP estimate has to improve because Gyap — 0. 

We can plot the posterior distribution Gaussian(@qap, Oi;,p) as a function of the 
number of samples NV. Figure 8.19 illustrates this example using the following configurations. 
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The likelihood is Gaussian with = 1, 0 = 0.25. The prior is Gaussian with vo = 0 and 
o = 0.25. We construct the Gaussian according to Gaussian(@map, %;,p) by varying N. 
The result shown in Figure 8.19 confirms our prediction: As N grows, the posterior becomes 
more like a delta function whose mean is the true mean p. The posterior estimator Oyap, 
for each N, is the peak of the respective Gaussian. 


8 T 


0 
-1 -0.5 0 0.5 1 1.5 


Figure 8.19: Posterior distribution fo)x (6|2) = Gaussian(@map, omap) as N grows. When N is small, 
the posterior distribution is dominated by the prior. As N increases, the posterior distribution changes 
its mean and its variance. 


What is the pictorial interpretation of the MAP estimate? 
e For every N, MAP has a posterior distribution fe) x (A|x). 


e As N grows, fe; x (8|x) converges to a delta function centered at Out. 


e MAP tries to find the peak of fo;x (@|a). For large N, it returns Ou. 


8.3.6 Conjugate prior 


Choosing the prior is an important topic in a MAP estimation. We have elaborated two 
“engineering” solutions: Use your prior experience or follow the physics. In this subsection, 
we discuss the third option: to choose something computationally friendly. To explain what 
we mean by “computationally friendly”, let us consider the following example, thanks to 
Avinash Kak.* 

Consider a Bernoulli distribution with a PDF 


N 


fxjo(#l9) = [] ea -@)*. (8.48) 


n=1 


To compute the MAP estimate, we assume that we have a prior fo(@). Therefore, the MAP 


4 Avinash Kak “ML, MAP, and Bayesian — The Holy Trinity of Parameter Estimation and Data Pre- 
diction”, https: //engineering.purdue.edu/kak/Tutorials/Trinity.pdf 
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estimate is given by 


Outap = arene fx\o (|) fo() 


N 
= argmax 6?" (1 — 6)'-** | - fo(@ 
en TL (1— 6) | fo(9) 
N 
= argmax ye Ln log 6 + (1 — rp) log(1 — A) + log fo (A). 
6 


n=1 


Let us consider three options for the prior. Which one would you use? 
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Candidate 1: fo(0) = aS exp { as \. a Gaussian prior. If you choose this 


prior, the optimization problem will become 


A “ (0 - p)? 
OMAP = argmax 2d, {em log é + (1 — 2,,) log(1 0} 552 


We can still take the derivative and set it to zero. This gives 


yi Tn N- ye Tn _O-p 


0 1-0 a? 


Defining S = ae Lp and moving the terms around, we have 
(1—6)o7S — 007(N —S) =6(1—6)(0 — 1). 
This is a cubic polynomial problem that has a closed-form solution and is also solvable 


by a computer. But it’s also tedious, at least to lazy engineers like ourselves. 


Candidate 2: fo(@) = Ae All, a Laplace prior. In this case, the optimization problem 
becomes 


N 
OMap = argmax S- {i log @ + (1 — 2,,) log(1 — } — OI. 
6 


n=1 
Welcome to convex optimization! There is no closed-form solution. If you want to solve 
this problem, you need to call a convex solver. 
Candidate 3: fo(0) = 40° -1(1 — 6)~!, a beta prior. This prior looks very compli- 
cated, but let’s plug it into our optimization problem: 


N 
0; Ap = argmax {nto é 
MAP a Da & 


+ (1 —2,) log(1— } + (a — 1) log 6 + (6 — 1) log(1 — @) 


=argmax (S+a-—1)logé+(N-—S+ 8-1) log(1 — 9), 
6 
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where S = ay Ln. Taking the derivative and setting it to zero, we have 
Saweol  Neaaseeat 


0 1-0 
Rearranging the terms we obtain the final estimate: 
~ Saad 

0 = —_____., 8.49 

Mee Nie hao eee) 


There are a number of intuitions that we can draw from this beta prior, but most 
importantly, we have obtained a very simple solution. That is because the posterior distri- 
bution remains in the same form as the prior, after multiplying by the prior. Specifically, if 
we use the beta prior, the posterior distribution is 


fo\x (A|x) « fxje(#|0) fo (A) 
N 
= TL oe" (1 — an a1 — 6)e-? 
= iG _ la a 


This is still in the form of 6*~1(1 — 6)™-!, which is the same as the prior. When this 
happens, we call the prior a conjugate prior. In this example, the beta prior is a conjugate 
before the Bernoulli likelihood. 


What is a conjugate prior? 


e It is a prior such that when multiplied by the likelihood to form the posterior, 
the posterior fo) x (9|a) takes the same form as the prior fo(9). 


e Every likelihood has its conjugate prior. 


e Conjugate priors are not necessarily good priors. They are just computationally 
friendly. Some of them have good physical interpretations. 


We can make a few interpretations of the beta prior, in the context of Bernoulli likeli- 
hood. First, the beta distribution takes the form 


1 
Bla, B) 


with B(a,@) is the beta function®. The shape of the beta distribution is shown in Fig- 
ure 8.20. For different choices of a and (6, the distribution has a peak located towards 
either side of the interval [0,1]. For example, if a is large but @ is small, the distribution 
fo(@) leans towards 1 (the yellow curve). 

As a user, you have the freedom to pick fe(@). Even if you are restricted to the beta 
distribution, you still have plenty of degrees of freedom in choosing @ and ( so that your 
choice matches your belief. For example, if you know ahead of time that the Bernoulli 
experiment is biased towards 1 (e.g., the coin is more likely to come up heads), you can 
choose a large a and a small 8. By contrast, if you believe that the coin is fair, you choose 
a = 2. The parameters a and @ are known as the hyperparameters of the prior distribution. 
Hyperparameters are parameters for fo(6). 


grep, (8.50) 


5The beta function is defined as B(a, 8) = ee where IT is the gamma function. For integer n, 
I'(n) = (n— 1)! 
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Figure 8.20: Beta distribution fo(@) for various choices of a and G. When (a, {) = (2,8), the beta 
distribution favors small 6. When (a,() = (8,2), the beta distribution favors large 0. By swinging 


between the (a, 3) pairs, we obtain a prior that has a preference over 0. 
Example 8.20. (Prior for Gaussian mean) Consider a Gaussian likelihood for a fixed 


N N 5 
fxio(el6) = (a=) on) rh 


Show that the conjugate prior is given by 


variance o” and unknown mean 0: 


(8.51) 


Solution. We have shown this result previously. By some (tedious) completing squares, 


O— 2 
Foix (Ola) = ape | -OS EE I 
N 


we show that 


Since fo)x (|x) is in the same form as fe(?), we know that fe(@) is a conjugate prior. 
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Example 8.21. (Prior for Gaussian variance) Consider a Gaussian likelihood for a 
mean jt and unknown variance o?: 


fx\o(@la) = (Ges) of - : 


Find the conjugate prior. 
Solution. We first define the precision 6 = =. The likelihood is 


We propose to choose the prior fe(0) as 


fo(6) = TO exp {09}, 


for some a and b. This fo() is called the Gamma distribution Gamma(6|a, b). We can 
show that E/0] = ¢ and Var[6] = ;&. With some (tedious) completing squares, we 
show that the posterior is 


N 
1 
ao+N/2)—1 2 
fox (Bla) x ofret /2) on (043 olen jek. 


I 


which is in the same form as the prior. So we know that our proposed fo(@) is a 
conjugate prior. 


The story of conjugate priors is endless because every likelihood has its conjugate prior. 
Table 8.1 summarizes a few commonly used conjugate priors, their likelihoods, and their 
posteriors. The list can be expanded further to distributions with multiple parameters. For 
example, if a Gaussian has both unknown mean and variance, then there exists a conjugate 
prior consisting of a Gaussian multiplied by a Gamma. Conjugate priors also apply to multi- 
dimensional distributions. For example, the prior for the mean vector of a high-dimensional 
Gaussian is another high-dimensional Gaussian. The prior for the covariance matrix of a 
high-dimensional Gaussian is the Wishart prior. The prior for both the mean vector and the 
covariance matrix is the normal Wishart. 


8.3.7 Linking MAP with regression 


ML and regression represent the statistics and the optimization aspects of the same problem. 
With the parallel argument, MAP is linked to the regularized regression. The reason follows 
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Table of Conjugate Priors 


Likelihood Conjugate Prior Posterior 

fxjo(2|) fo() fox (|x) 

Bernoulli(@) Beta(a, 3) Beta(a+ S,8+N—S) 
Poisson(6) Gamma(a, (3) Gamma (a +S, tx) 
Exponential(@) Gamma(a, (3) Gamma (a +N, is) 
Gaussian(0, 07) Gaussian(j19, 02) Gaussian (a4 _ ia ) 


Gaussian (1, 07) Inv. Gamma(a, 3) Gamma (a +4 64+ Ey gis - )?) 


Table 8.1: Commonly used conjugate priors. 


immediately from the definition of MAP: 


Ouap = argmax log fxje@(x|@) + log fo(A) . 
(J nm“) - -__“__’ —~_SE——_- 
data fidelity regularization 


To make this more explicit, we consider following linear regression problem: 


Yi bo(t1) G1(%1) +++ ba—1(%1) 6 ey 
yo} do(t2) dil(te) +++ dba-i(x2) 0, 7 €2 
uv| [do(ew) drew) ++ da-r(ew)| [Oar] lew 
a 

=y =X =0 =e 


If we assume that e ~ Gaussian(0, 07), the likelihood is defined as 


1 1 

0) = XO\? >. 8.52 

Fyiolul®) = ors exp { ~zraly ~ xo)? (8.52) 
In the ML setting, the ML estimate is the maximizer of the likelihood: 
Ou = se log fyje(y|9) 
1 
= ——|ly — X6||?. 
BREE 5 Ily | 


For MAP, we add a prior term so that the optimization becomes 


Ouap = onan log fyje(yl@) + log fo(A) 


1 
— aremi X62 — log fe(@). 
aremniy dolly II“ — log fe (@) 
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Therefore, the regularization of the regression is exactly — log fe@(@). We can perform reverse 
engineering to find out the corresponding prior for our favorite choices of the regularization. 
Ridge regression. Suppose that 


= \|a\|* 
fo(9) = exp { - 202 i 


Taking the negative log on both sides yields 


|| ||? 
—1 6) = : 
Putting this into the MAP estimate, 
Oyrap = argmi xo|? +56] 
Map =argmin 5 lly — X6l? + 5 l6l 


2 
a 

=argmin ||y — X6@||? + =a \|A1/?, 
0 9% 


Sa’ 
=X 


where A is the corresponding ridge regularization parameter. Therefore, the ridge regression 
is equivalent to a MAP estimation using a Gaussian prior. 


How is MAP related to ridge regression? 


e In MAP, define the prior as a Gaussian: 


fo(@) = exp {- cl \ (8.53) 


0 


e The prior says that the solution @ is naturally distributed according to a Gaussian 
with mean zero and variance 9. 


LASSO regression. Suppose that 


fo(8) = exp {_H2lh . 


Taking the negative log on both sides yields 


— log fe(@) = [6h 


Putting this into the MAP estimate we can show that 
6 in s-ally — XO? + — [| 
=argmin =>|ly — — 
MAP = 22 ¥y 7 1 
a 


2 
(|. 
a 
Ss 


1 
=argmin =|ly— X6||* 4 
oS 
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To summarize: 


How is MAP related to LASSO regression? 
e LASSO is a MAP using the prior 


fo(6) = ep {Hh 


At this point, you may be wondering what MAP buys us when regularized regression 
can already do the job. The answer is about the interpretation. While regularized regression 
can always return us a result, that is just a result. However, if you know that the parameter 0 
is distributed according to some distributions f@(@), MAP offers a statistical perspective of 
the solution in the sense that it returns the peak of the posterior fe; x (0|x). For example, if 
we know that the data is generated from a linear model with Gaussian noise, and if we know 
that the true regression coefficients are drawn from a Gaussian, then the ridge regression is 
guaranteed to be optimal in the posterior sense. Similarly, if we know that there are outliers 
and have some ideas about the outlier statistics, perhaps the LASSO regression is a better 
choice. 

It is also important to note the different optimalities offered by MAP versus ML versus 
regression. The optimality offered by regression is the training loss, which can always give 
us a result even if the underlying statistics do not match the optimization formulation, 
e.g., there are outliers, and you use unregularized least-squares minimization. You can get a 
result, but the outliers will heavily influence your solution. On the other hand, if you know 
the data statistics and choose to follow the ML, then the ML solution is optimal in the sense 
of optimizing the likelihood fx )@(x|@). If you further know the prior statistics, the MAP 
solution will be optimal, but this time it is optimal w.r.t. to the posterior fe) x (@|xz). Since 
each of these is optimizing for a different goal, they are only good for their chosen objectives. 
For example, @\jap can be a biased estimate if our goal is to maximize the likelihood. The 
Oui is optimal for the likelihood but can be a bad choice for the posterior. Both @yap and 
Our can possibly achieve a reasonable mean-squared error, but their results may not make 
sense (e.g., if 9 is an image then O@\yap may over-smooth the image whereas 0,1, amplifies 
noise). So it’s incorrect to think that Om Ap is superior to Ou because it is more general. 

Here are some rules of thumb for MAP, ML, and regression: 


When should | use regression, ML and MAP? 


e Regression: If you are lazy and you know nothing about the statistics, do the 
regression with whatever regularization you prefer. It will give you a result. See 
if it makes sense with your data. 


MAP: If you know the statistics of the data, and if you have some preference for 


the prior distribution, go with MAP. It will offer you the optimal solution w.r.t. 
finding the peak of the posterior. 


ML: If you are interested in some simple-form solution, and you want those nice 
properties such as consistency and unbiasedness, then go with ML. It usually 
possesses the “friendly” properties so that you can derive the performance limit. 
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8.4 Minimum Mean-Square Estimation 


First-time readers are often tempted to think that the maximum-likelihood estimation or 
the maximum a posteriori estimation are the best methods to estimate parameters. In some 
sense, this is true because both estimation procedures offer some form of optimal explanation 
for the observed variables. However, as we said above, being optimal with respect to the 
likelihood or the posterior only means optimal under the respective criteria. An ML estimate 
is not necessarily optimal for the posterior, whereas a MAP estimate is not necessarily 
optimal for the likelihood. Therefore, as we proceed to the third commonly used estimation 
strategy, we need to remind ourselves of the specific type of optimality we seek. 


8.4.1 Positioning the minimum mean-square estimation 


Mean-square error estimation, as it is termed, uses the mean-square error as the optimality 
criterion. The corresponding estimation process is known as the minimum mean-square 
estimation (MMSE). MMSE is a Bayesian approach, meaning that it uses the prior fe(@) 
as well as the likelihood fx)@(#|@). As we will show shortly, the MMSE estimate of a set 
of i.i.d. observation X = [Xj,...,Xwy]7 is 


Ouse (x) @ Ee x [O|X = a} (a) : We will discuss this. 
= [9 fox (Ox) a0 (8.55) 


You may find this equation very surprising, because it says that the MMSE estimate is 
the mean of the posterior distribution f@)x(0|x). Let’s compare this result with the ML 
estimate and the MAP estimate: 


Own. = peak of fxje(x | 6), 
Ouap = peak of fox)(4 | x), 


Ouse = average of fox (0 | x). 


Therefore, an MMSE estimate is not by any means universally superior or inferior to a MAP 
estimate or an ML estimate. It is just a different estimate with a different goal. 

So how exactly are these estimates different? Figure 8.21 illustrates a typical situation 
of asymmetric distribution. Here, we plot both the likelihood function fx|@(# | @) and the 
posterior function fe x) (4 | 2). 

As shown in the figure, the ML estimate is the peak of the likelihood, whereas the MAP 
estimate is the peak of the posterior. The third estimate is the MMSE estimate, which is 
the average of the posterior distribution. It is easy to see that if the posterior distribution 
is symmetric and has a single peak, the peak is always the mean. Therefore, for single-peak 
symmetric distributions, MMSE and MAP estimates are identical. 
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Figure 8.21: A typical example of an ML estimate, a MAP estimate and an MMSE estimate. 


What is so special about the MMSE estimate? 
e MMSE is a Bayesian estimation, so it requires a prior. 


e An MMSE estimate is the mean of the posterior distribution. 


e MMSE estimate = MAP estimate if the posterior distribution is symmetric and 
has a single peak. 


8.4.2 Mean squared error 


The MMSE is based on minimizing the mean squared error (MSE). In this subsection we 
discuss the mean squared error in the Bayesian setting. In the deterministic setting, given 
an estimate @ and a ground truth 0, the MSE is defined as 
MSE 6 )=(0- 6). ; 
SECO 9) = (0-8) (8.56) 


ground truth estimate 


In any estimation problem, the estimate @ is always a function of the observed variables. 
Thus, we have m 

0(X)=g(X), where X=[X,...,Xn]’, 
for some function g(-). Substituting this into the definition of MSE, and recognizing that X 
is drawn from a distribution fx (x), we take the expectation to define the MSE as 


Xn n~ 


MSE(6, 0) = (0 — 0)? 
{) replace 6 by g(X) 
MSE(6, 6) = (0 — 9(X))? 


{| take expectation over X 


n~ 


MSE(6, 6) = Ex [(0 — 9(X))’]. 


Thus we have arrived at the definition of MSE. We call this the frequentist version, because 
the parameter @ is deterministic. 
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Definition 8.8 (Mean squared error, frequentist). The mean squared error of an 
estimate g(X) w.r.t. the true parameter 0 is 


MSE freq(9,9(-)) = Ex [(0— g(X))?] . (8.57) 


If the parameter 0 is high-dimensional, so is the estimate g(X), and the MSE is 


MSE freq(9,.9(-)) = Ex [\]@ — 9(X)|I’] . (8.58) 


Note that in the above definition the MSE is measured between the true parameter @ and 
the estimator g(-). We use the function g(-) here because we have taken the expectation 
of all the possible inputs X. So we are not comparing @ with a value g(X) but with the 
function g(-). 

If we take a Bayesian approach such as the MAP, then @ itself is a random variable 0. 
To compute the MSE, we then need to take the average across all the possible choices of 
ground truth 0. This leads to 


a 


MSE(6, 0) = Ex [(6 — 9(X))?] 
{| replace 6 by © 


oa 


MSE(6, 0) = Ex [(O — 9(X))?] 


{) take expectation over © 


~ 


MSE(0, 0) = Ex,6 [(O — g(X))?] . 


Therefore, we have arrived at our definition of the MSE, in the Bayesian setting. 


Definition 8.9 (Mean squared error, Bayesian). The mean squared error of an es- 
timate g(X) w.r.t. the true parameter © is 


MSEpayes(®, 9(-)) = Eo,x [(O — g(X))’] . (8.59) 


If the parameter © is high-dimensional, so is the estimate g(X), and the MSE is 


MSEBayes(®, g(-)) = Eo,x [|| — 9(X)I|*]- (8.60) 


The difference between the Bayesian MSE and the frequentist MSE is the expectation over O. 
Practically speaking, the frequentist MSE is more of an evaluation metric than an objective 
function for solving an inverse problem. The reason is that in an inverse problem, we never 
have access to the true parameter 6. (If we knew @, there would be no problem to solve.) 
Bayesian MSE is more meaningful. It says that we do not know the true parameter 0, but 
we know its statistics. We are trying to find the best g(-) that minimizes the error. Our 
solution will depend on the statistics of O but not on the unknown true parameter 0. 

When we say minimum mean squared error estimation, we typically refer to the 
Bayesian MMSE. In this case, the problem we solve is 


g(-) = ati Z0,x [(8 — g(X))*]. (8.61) 
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As you can see from Definition 8.9, the goal of the Bayesian MMSE is to find a function 
g: RN +R such that the joint expectation Eo,x [(O — g(X))?] is minimized. In the case 
where © is a vector, the problem becomes 


g() = — be,x [|© — 9(X)|"], (8.62) 


where g(-) : RN*¢ — R?¢ if © is a d-dimensional vector. The function g will take a sequence 
of N observed numbers and estimate the parameter O. 


What is the Bayesian MMSE estimate? 
The Bayesian MMSE estimate is obtained by minimizing the MSE: 


ag) = ae te,x [(O— g(X))?]. 


8.4.3 MMSE estimate = conditional expectation 


Theorem 8.2. The Bayesian MMSE estimate is 


@umsn = oe be,x [(O — g(X))?] 
ga 


Sex [| X = al. 


Proof. First of all, we decompose the joint expectation: 


tex [(© - 9(X))?] = / tox [(@ — 9(X))? | X =a] fx (a) de. 


Since fx (x) > 0 for all w, and Eg)x [(@ — g(X))? | X = a] > 0 because it is a square, it 
follows that the integral is minimized when Eg) x [(O — g(X))? | X = a] is minimized. 
The conditional expectation can be evaluated as 


e|x|( — 9(X))?| X = 


= Eg\x le? — 20g(X) + g(X)? 


X=] 


= Reig G | Pa c| =e c | x= x] (a + g(x)? 
eee ee ee ee 


def. def 


SV (@) Su(@) 
= V(a) — 2u(a)g(a) + g(x)? + u(a)* — u(x)? 
= V(x) — u(a)? + (u(x) — g(x)? 
>V(x)—u(x)?, — V9(-), 


where the last inequality holds because no matter what g(-) we choose, the square term 
(u(x) — g(x))? is non-negative. Therefore, Eg) x [(O — g(X))? | X = a] is lower-bounded by 
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V(x) — u(x)’, which is a bound that is independent of g(-). If we can find a g(-) such that 
this lower bound can be met, the corresponding g(-) is the minimizer. 

To this end we only need to make Eg)x[(9@ — g(X))? | X = a] equal V(a) — u(ax)?, 
but this is easy: the equality holds if and only if (u(a) — g(a))? = 0. In other words, if we 
choose g(-) such that g(a) = u(a), the corresponding g(-) is the minimizer. This g(-), by 
substituting the definition of u(x), is 


g(x) = Eex le | <= e]. (8.65) 


This completes the proof. 


What is the MMSE estimate? 


The MMSE estimate is 


Oumsp(@) = Ee|x[O | X = a]. 


We emphasize that Ouse (2x) is a function of x, because for a different set of observations 
zx we will have a different estimated value. Since x is a random realization of the random 
vector X, we can also define the MMSE estimator as 


Oumsr(X) = Ee)x [© | X]. (8.67) 


In this notation, we emphasize that the estimator Ommse returns a random parameter. The 
input to the estimator is the random vector X. Because we are not looking at a particular 
realization X = a but the general X, Oye is a function of X and not a. 


Conditional expectation of what? 


An MMSE estimator is the conditional expectation of O given X = a: 


Eo\x le | X= e| = Jo fox (A|x) do. 


This is the expectation using the posterior distribution fe) x (0|x). It should be compared 
to the peak of the posterior, which returns us the MAP estimate. The posterior distribution 
is constructed through Bayes’ theorem: 


fx\o (#19) fo(?) 
fx(#) 


Therefore, to evaluate the expectation of the condition distribution, we need to include the 
normalization constant fx (a), which was omitted in MAP. 


fox (|x) = (8.69) 
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The discussion about the mean squared error and the vector estimates can be skipped if 
this is your first time reading the book. 


What is the mean squared error when using the MMSE estimator? 


e The mean squared error conditioned on the observation is 


de 


MSE(@, Oumsn(X)) © Eo)x[(@ — Ommsn(X))? | X] 
= Vare|x [O|X], 


which is the conditional variance. 


e The overall mean squared error, unconditioned, is 


MSE(, Oumss(-)) = Ex [Vare)x[9|X] 
= Vare [O}. 


Proof. Let us prove these two statements. The resulting MSE is obtained by substituting 
OumsE(x) = Eg)x [9 | x] into the MSE(0, Oymmsz(X)). To this end, we have that 


Ze|x [(O — Oumse(X))? | X] = V(X) — u(X)? 
+ (u(X)~ Oumse(¥))? 


=0, because OvmsE (X)=Eg9)x [O| X]=u(X) 


The variables V and wu are defined as 


V(X) =Eg)x (0? | X] = 2nd moment of © using fo)x (|x), 


u(X) = Eg)x [O | X] = 1st moment of © using fox (6x). 


Since Var[Z] = E[Z*] — E[Z]? for any random variable Z, it follows that 


29x [(O — Oumse(X))? | X] = V(X) —u(X)? 
=Eg)x (6? | X] - (Eo)x[0 | X])” 


= variance of © using fe) x (|x) 


' Vare)x [|X]. 


Substituting this conditional variance into the MSE definition, 


MSE(®, Oumsn(:)) = / Z6)x [(O — Ommsn(X))? | X = #] fx (a) dx 


= [ Varox [|x = «| fx (x) dx 
= Vare [0]. 


526 


8.4. MINIMUM MEAN-SQUARE ESTIMATION 


What happens if the parameter is a vector? 
¢ The MMSE estimate is @mmsn(x) = Eo)x[O|X = 2]. 
e The MSE is 


MSE(@, Oumsn(-)) = TH ibe {cov(@|x)} |. 


Proof. The first statement, that the MMSE estimate is 


Oumen(2) = @|x|O|X = a], 


is easy to understand since it just follows from the scalar case. The estimator is Owmsn(X j= 
‘@|x |O|X]. The corresponding MSE is 


MSE(®, Oumsx(-)) = Eo,x[||© — Ovmse(X)||7] 


= Ex{Eo,x{|@ ~ Oynise(X)I? | X)}, 


where we have used the law of total expectation to decompose the joint expectation. Using 
the matrix identity below, we have that 


sx{ s@)x{ll@ — Osinuse(X)I | x} 


= ox bei [t {(e ~ 6umsn(X))(@ — Oxnse(X))? } | x] \ 


= mf ix{ Lex (eo — Ovmse(X))(@ — Omse(X))” | x] \} 


However, since the MMSE estimator is the condition expectation of the posterior, it follows 
that the inner expectation is the conditional covariance. Therefore, we arrive at the second 
statement: 


MSE(®, Oumsn(-)) = Tr { sx ae le ~ Oumse(X))(@ — Oumse(X))T | x| i 


= T 2x {cov(@|x)}}. 


To prove the two statements above, we need some tools from linear algebra. The two 
specific matrix identities are given by the following lemma: 


Lemma 8.1. The following are matria identities: 


e For any random vector © € R¢, 


||? = Tr(@7@) = T(E"). 
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e For any random vector © € R4, 


Le[Tr(Q@")| = Tr(E 


The proof of these two results is straightforward. The first is due to the cyclic property of 
the trace operator. The second statement is true because the trace is a linear operator that 
sums the diagonal of a matrix. 


The end of the discussion. Please join us again. 


Example 8.22. Let 


be", a 20; 
0, x<0O, 


fxjo(z|9) = 


Find the ML, MAP, and MMSE estimates for a single observation X = x. 
Solution. We first find the posterior distribution: 
fox (@|z) = SE 
ade («+20 
- iE aGetere)? dd 
abe (a+2)8 


tay 


= (a+ 2)?Ge—(0t7)8, 


The MMSE estimate is the conditional expectation of the posterior: 


6umsn(x) He|x[O|X = z] 
= i Ofoix (|x) a9 


i O(a + x)*6e~(o+*)8 Gg 
0 


=(a+2) i 0? (a+ a)e—t*)? ap 
0 


2nd moment of exponential distribution 


re) 2 
x): = 


(a+2)2 a+a 
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The MAP estimate is the peak of the posterior: 
Omap(x) = seen log fxje(2|) + log fo (4) 


=argmax —0x+ log@—aé+loga. 
6 


Taking the derivative and setting it to zero yields —x + ; —a=0. This implies that 


1 


Smap(2) = atn 


Finally, the ML estimate is 


~ 1 
Oui (x) = a log fxje(x|@) = a 


Practice Exercise 8.8. Following the previous example, derive the estimates for 
multiple observations X = x. 


Solution. The posterior is 


_ fxjo(#|9) fol) 
_ (Tins Fxio(@nl9)) fo®) 
fx (x) 
7 aig CREAEa In )O 


7 ee abe~(e+Enai tn)6 dg 


2 
= (: aF oS on) Be (ot Lina en) 0 
iw 


Therefore, we are only replacing x by the sum ey ZX, in the posterior. Hence, the 
estimates are: 


Oumsn(x) = 
aan 


Gap(x) = ——— 


bau (2) = y— 


This example shows that as N — oo, the ML estimate Our (2) — 0. The reason is that the 
likelihood is an exponential distribution. Therefore, the peak is always at 0. The posterior is 
an Erlang distribution, and therefore the peak is offset by a in the denominator. However, 
as N — oo the posterior distribution is dominated by the likelihood, so the peak is shifted 
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towards 0. Finally, since the Erlang distribution is asymmetric, the mean is different from 
the peak. Hence, the MMSE estimate is different from the MAP estimate. 
8.4.4 MMSE estimator for multidimensional Gaussian 


The multidimensional Gaussian has some very important uses in data science. Accordingly, 
we devote this subsection to the discussion of the MMSE estimate of a Gaussian. The main 
result is stated as follows. 


What is the MMSE estimator for a multi-dimensional Gaussian? 


Theorem 8.3. Suppose © € R4 and X € RN are jointly Gaussian with a joint PDF 


e He| |Ze0 Lex 
Fe ~ Gaussian (Ke ; Be ze : 


The MMSE estimator is 


Oumse(X) = be + Hox=Exx(X — wx). 


The proof of this result is not difficult but it is tedious. The flow of the argument is: 


e Step 1: Show that the posterior distribution fe) x (O|x) is a Gaussian. 
e Step 2: To do so we need to complete the squares for matrices. 


e Step 3: Once we have the fe; x (@|x), the posterior mean is the MMSE estimator. 


The proof below can be skipped if this is your first time reading the book. 


Proof. The posterior PDF is 


fo\x (O|x) = fox 


ae -1 
1 exp 1 ba am ea Be | : _ ‘el 
J (20) 4+N |B Z-pUyx}| [Yxe Uxx L— fly 


a 
Tee OP {73 [e- x] Bick [@ - #x]} 


NI 


Without loss of generality, we assume that uy = fe = 0. Then the posterior becomes 


1 
(20)4|3]/|Exx| 


T 4 
1]/@|° |Xeo Yex OC). bet 
«esr | 5 2 Be i. ne apa Uy xe pe. 


—_$_$_ 
H(0,x) 


fo|x (@|x) = 


The tedious task here is to simplify H(0, x). 
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Regardless of what the 2-by-2 matrix inverse is, the matrix will take the form 


Sea Nox] _/A B 
Xxo Uxx}  |[C Dy’ 
for some choices of matrices A, B, C and D. Therefore, the function H(6@, a) can be written 
as 
1 
H(0,x) = —5 {9740 +0" Bx + x7CO + x7 Dx — aT Exh. (8.72) 
Our goal is to complete the square for H(0,x). To this end, we propose to write 
1 
H(0,x) = -5{(@ ~ Gx)? A(0 — Gx) + Q(a)}, (8.73) 
for some matrix G and function Q(-) of a only. If we compare Equation (8.72) and Equa- 
tion (8.73), we observe that G must satisfy 
G=-A™'B. 


Therefore, if we can determine A and B, we will know G. If we know G, we have completed 
the square for H(@,x). If we can complete the square for H(@,x), we can write 


exp{—Q(ax)/2} 
(27 )4|3|/[2x x| 


constant in @ a Gaussian 


fo|x (A|x) a 


exp {-5(0 — Ga)" A(@— Gx)}. 


Hence, the MMSE estimate, which is the posterior mean E[O|X = a], is simply Ga: 


Ouse (2x) = 1[O| Xx = 


So it remains to determine A and B by solving the tedious matrix inversion problem. The 
result is:® 


A= (Zee — Vex EX =xe)7), 

B= —(See — Hox EX Exo) YoxUyk, 

C = (Exx — ExeXgg¥Uox) 'UxeXeo, 

D = (Xxx — DxeUooZeox) '. 
Therefore, plugging everything into the equation, 

Ouse (2) =-A'Ba 
=o 7 ee. 

For non-zero means, we can repeat the same arguments above and show that 


Oumsn(2) = He + Lo,x Bx) (a — px). 


6See Matrix Cookbook https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf Section 9.1.5 
on the Schur complement. 
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End of the proof. Please join us again. 


Practice Exercise 8.9. Suppose @ € R@ and X € RN are jointly Gaussian with a 


joint PDF 
\S) & He| |Yoo Lox 
ie ~ Gaussian (i F ee Se 
We know that the MMSE estimator is 
Oumsn(X) = Me + Box DX (X — wx): (8.74) 


Find the mean squared error when using the MMSE estimator. 
Solution. Conditioned on X = a, according to Equation (8.70), the MMSE is 


MSE(@, O(X)) = Tr {Cov[@|X]}. 


The conditional covariance Cov|©|X] is the covariance of the posterior distribution 
fe|x (|x), which is 


Tr {Cov[O|X]} = Tr {A} 
=Tt|(@ise— Sox =e, > xo!) )- 


The overall mean squared error is 


MSE (©, 6(-)) ix [MSE(, 6(X))| 


= | MSE(©,6(e)) fx(e) dx 
2 | Tr {Cov[@|X]} fx (a) dex 
7 fo {(Zee — HoxEyy Exe) *} fx (x) dx 


= Tr {(Ye0 — NexUXxDxo) |} / fx (x) dx 


= Tr{(See — NexUxkExe) '}. 


For multidimensional Gaussian, does MMSE = MAP? 
The answer is YES. 


Theorem 8.4. Suppose © € R4 and X € RN are jointly Gaussian with a joint PDF 
e He| |Xe0 Yex 
Fe ~ Gaussian (Ke : Be ele 
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The MAP estimate is 


@wap(X) = Ue + ex Dx (X — px). 


Proof. The proof of this result is straightforward. If we return to the proof of the MMSE 
result, we note that 


fo|x (|x) = 


exp{-Q(@)/2}_ {lig emt arp Ge 
2m) AEI/Exx xe | 3(0- Ga)" A(0- Ga)}. 


constant in @ a Gaussian 


Therefore, the maximizer of this posterior distribution, which is the MAP estimate, is 
Ouap(a) = Sree fo, x (8|x) 
= ae -5(0 — Gz)’ A(O0 — Ga). 
Taking the derivative w.r.t. 8 and setting it zero, we have 
Outap (a) = Ger = Vex Uy yt: 


If the mean vectors are non-zero, we have Oyrap (2x) = po + Nex EZ (a — px). 


8.4.5 Linking MMSE and neural networks 


The blossoming of deep neural networks since 2010 has created a substantial impact on 
modern data science. The basic idea of a neural network is to train a stack of matrices and 
nonlinear functions (known as the network weights and the neuron activation functions, 
respectively), among other innovative ideas, so that a certain training loss is minimized. 
Expressing this by equations, the goal of the learning is equivalent to solving the optimization 
problem 


§{-) = argmin Ex.6 [| — 9X7] (8.76) 
9) 

where X € R™ is the input data and © € R?@ is the ground truth prediction. We want to 

find g(-) such that the error is minimized. 

The error we choose here is the @g-norm error || - ||?. It is only one of many possi- 
ble choices. You may recognize that this is exactly the same as the MMSE optimization. 
Therefore, the neural network we are finding here is the MMSE estimator. Since the MMSE 
estimator is the conditional expectation of the posterior distribution, the neural network 
approximates the mean of the posterior distribution. 

Often the struggle we have with deep neural networks is whether we can find the 
optimal network parameters via optimization algorithms such as the stochastic gradient 
descent algorithms. However, if we think about this problem more deeply, the equivalence 
between the MMSE estimator and the posterior mean tells us that the hard part is related 
to the posterior distribution. In the high-dimensional landscape, it is close to impossible to 
determine the posterior and its mean. If we add to these difficulties and the nonconvexity 
of the function g, training a network is very challenging. 
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One misconception about neural networks is that if we can achieve a low training error, 
and if the model can also achieve a low testing error, then the network is good. This is a false 
sense of satisfaction. If a model can achieve very good training and testing errors, then the 
model is only good with respect to the error you choose. For example, if we choose the @5- 
norm error ||-||? and if our model achieves good training and testing errors (in terms of ||-||?), 
we can conclude that the model does well with respect to || - ||?. The more serious problem 
here, unfortunately, is that || - ||? is not necessarily a good metric of performance (for both 
training and testing) because training with ||-||? is equivalent to approximating the posterior 
mean. There is absolutely no reason to believe that in the high-dimensional landscape, the 
posterior mean is the optimal. If we choose the posterior mode or the posterior median, 
we will also obtain a result. Why are the modes and medians “worse” than the mean? 
In practice, it has been observed that training deep neural networks for image-processing 
tasks generally leads to over-smoothed images. This demonstrates how minimizing the mean 
squared error || - ||? can be a fundamental mismatch with the problem. 


Is minimizing the MSE the best option? 


e No. Minimizing the MSE is equivalent to finding the mean of the posterior. There 
is no reason why the mean is the “best”. 


You can find the mode of the posterior, in which case you will get a MAP 
estimator. 


You can also find the median of the posterior, in which case you will get the 
minimum absolute error estimator. 


Ultimately, you need to define what is “good” and what is “bad”. 


The same principle applies to deep neural networks. Especially in the regression 
setting, why is || - ||? a good evaluation metric for testing (not just training)? 


8.5 Summary 


In this chapter, we have discussed the basic principles of parameter estimation. The three 
building blocks are: 


e Likelihood fxj\@(x|@): the PDF that we observe samples X conditioned on the un- 
known parameter ©. In the frequentist world, © is a deterministic quantity. In the 
Bayesian world, © is random and so it has a PDF. 


e Prior fo(@): the PDF of ©. The prior fe@(@) is used by all Bayesian computation. 


e Posterior fe; x (|x): the PDF that the underlying parameter is © = @ given that we 
have observed X = a. 


The three building blocks give us several strategies to estimate the parameters: 


e Maximum likelihood (ML) estimation: Maximize fx)@(2|@). 
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e Maximum a posteriori (MAP) estimation: Maximize fe x (0|x). 


e Minimum mean-square estimation (MMSE): Minimize the mean squared error, which 
is equivalent to finding the mean of fe) x (4|x). 


As discussed in this chapter, no single estimation strategy is universally “better” because 
one needs to specify the optimality criterion. If the goal is to minimize the mean squared 
error, then the MMSE estimator is the optimal strategy. If the goal is to maximize the 
likelihood without assuming any prior knowledge, the ML estimator would be the optimal 
strategy. It may appear that if we knew the ground truth parameter 6* we could minimize 
the distance between the estimated parameter @ and the true value 6”. If the parameter 
is a scalar, this will work. However, if the parameter is a vector, the noise of the distance 
becomes an issue. For example, if one cares about the mean absolute error (MAE), the 
optimal estimator would be the median of the posterior distribution instead of the mean of 
the posterior in the MMSE case. Therefore, it is the end user’s responsibility to specify the 
optimality criterion. 

Whenever we consider parameter estimation, we tend to think that it is about estimat- 
ing the model parameters, such as the mean of a Gaussian PDF. While in many statistics 
problems this is indeed the case, parameter estimation can be much broader if we link it 
with regression. Specifically, a regularized linear regression problem can be formulated as a 
MAP estimation 


0* =argmax ||XO—y||? + AR(O) , (8.77) 
(7) e+~_-— —n~ 
—log fxje(x|@)  —log fe(@) 


for some regularization R(@), which is also the negative log of the prior. Expressed in this 
way, we recognize that the MAP estimation can be used to recover signals. For example, we 
can model X as a linear degradation process of certain imaging systems. Then solving the 
MAP estimation is equivalent to finding the best signal explaining the degraded observation 
using the posterior as the criterion. There is rich literature dealing with solving MAP esti- 
mation problems similar to these in subjects such as computational imaging, communication 
systems, remote sensing, radar engineering, and recommendation systems, to name a few. 
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8.7 Problems 


Exercise 1. 
Let X1,...,Xy be a sequence of i.i.d. Bernoulli random variables with P[X, = 1] = @. 
Suppose that we have observed 21,...,2N. 


(a) Show that the PMF of X,, is px, (tn | 0) = 07" (1 — 0)'~*". Find the joint PMF 


PX,,..., So (Digan Ey 


(b) Find the maximum likelihood estimate 6, i.e., 


Ou = argmax log px,,....xy(%1,--.,0N). 
6 
Express your answer in terms of 71,...,2y. 


8.7. PROBLEMS 
(c) Let 6 = 1/2. Use Chebyshev’s inequality to find an upper bound for P{|Oy, —6| > 0.1]. 


Exercise 2. 
Let Y, = 0+ W,, be the output of a noisy channel where the input is a scalar 6 and 
W,, ~ N(0,1) is an iid. Gaussian noise. Suppose that we have observed y,...,Yn- 


(a) Express the PDF of Y,, in terms of 6 and y,. Find the joint PDF of Yi,..., Yn. 


(b) Find the maximum likelihood estimate Ossie Express your answer in terms of y1,...,Yn- 


(c) Find E[Or]. 


Exercise 3. 
Let X1,...,Xw be a sequence of i.i.d. Gaussian random variables with unknown mean 6, 
and variance 62. Suppose that we have observations 21,...,2N. 


(a) Express the PDF of X,, in terms of x,,, 6; and 62. Find the joint PDF of Xj,..., Xy. 


(b) Find the maximum likelihood estimates of 0; and 62. 


Exercise 4. 
In this problem we study a single-photon image sensor. First, recall that photons arrive 
according to a Poisson distribution, i-e., the probability of observing k& photons is 


Me-A 
Py = =72—, 


where \ is the (unknown) underlying photon arrival rate. When photons arrive at the single- 
photon detector, the detector generates a binary response “1” when one or more photons 
are detected, and “0” when no photon is detected. 


(a) Let B be the random variable denoting the response of the single-photon detector. 
That is, 
ek 


Bia g 
fs Y =0. 


Find the PMF of B. 


re 
oo 


Suppose we have obtained T’ independent measurements with realizations B, = by, 
Bo = be, ..., Br = br. Show that the underlying photon arrival rate \ can be estimated 


by 
= ae be 
A = — log (: T : 


Get a random image from the internet and turn it into a grayscale array with values 
between 0 and 1. Write a MATLAB or Python program to synthetically generate a 
sequence of 7’ = 1000 binary images. Then use the previous result to reconstruct the 
grayscale image. 


— 
io) 
Nas 
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Exercise 5. 
Consider a deterministic vector s € R@ and random vectors 


fy\o(y|@) = Gaussian(s0, 5), 
fo(@) = Gaussian(p, 07). 


(a) Show that the posterior distribution is given by 
fojy (Aly) = Gaussian(m, q’), (8.78) 


where 


ry 


(b) Show that the MMSE estimate Ouse (y) is given by 


te Por ras MOT 
f eee ee 
MMSE (Y) oe (8.79) 
(c) Show that the MSE is given by 
& 1 
MSE(0, Oumse(Y’)) = =—- (8.80) 
e+ 
What happens when o > 0? 
(d) Give an interpretation of d?. What happens when d? — 0 and when d? + oo? 
Exercise 6. 
Prove the following identity: 
Yee Lox] 
Xxe Uxx 
_ (Lee — YexUXy Exe) —(Zee — YexDUyy Exe) Dox Dyy 


_ ese — XxeXgoLox) !UxeLeg (Uxx — UxeLeoLeox) 7! 


Hint: You can perform reverse engineering by checking whether the product of the left-hand 
side and the right-hand side would give you the identity matrix. 


Exercise 7. 

Let X 1, X2, X3 and X4 be four i.i.d. Poisson random variables with mean 9 = 4. Find the 
mean and variance of the following estimators O(X) for @ and determine whether they are 
biased or unbiased. 
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° 6(X) = (Xi + Xo)/2 

© O(X) = (X3 + X4)/2 

e O(X) = (X1 + 2X2)/3 

0 O(X) = (X, +X) 4+ X34 X4)/4 


Exercise 8. 
Let X1,...,Xy bei.i.d. random variables with a uniform distribution of [0, @]. Consider the 
following estimator: 


nN 


6(X) = max(X),..., Xv). (8.81) 


(a) Show that the PDF of 0 is fg(9) = N[Fx(x)|%~'fx(x), where fx and Fx are re- 
spectively the PDF and CDF of Xj. 


(b) Show that O is a biased estimator. 
(c) Find the variance of ©. Is it a consistent estimator? 


(d) Find a constant c so that cO is unbiased. 


Exercise 9. 
Let X1,...,Xn be iid. Gaussian random variables with unknown mean @ and known 
variance o = 1. 


(a) Show that the log-likelihood function is 


N 
log £(0| a) = 5 log(2m) — ; So (an — 9). (8.82) 


n=1 


(b) Let X? = Wnt a7, and X = cae ty. Show that X? > (X)? if and only if 
w(t — 8)? > 0 for all 0. 


n=1 


c) Use Python to plot the function log £(@|a), when X = 2 and X2 =1. 
(c) y Pp g 


Exercise 10. 
Let X1,...,Xy be i.i.d. uniform random variables over the interval [0,6]. 
Let T = max(Xj,..., Xy). 


(a) Consider the estimator h(X) = 4 yoy Xn. Is h(-) an unbiased estimator? 


(b) Consider the estimator 9(X) = + So; X». Is g(-) an unbiased estimator? 


a(x) =a = (“F*) 


(c) Show that 
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(d) Let g(X) = E[g(X)|T] = (#4) T. Show that 


Cae ane 
Svan as 


LHX) = ( 


(e) Show that 


(aX) - 8)" = (sare) 


Exercise 11. 
The Kullback-Leibler divergence between two distributions p;(a) and p2(a) is defined as 


KL(p: || 2) = f pr(e)log 2) ae, (8.83) 


Suppose we approximate p; using a distribution po. Let us choose pp = Gaussian(p, »). 
Show that pw and &, which minimize the KL divergence, are such that 


u= Lae~p1 (a) [x] and us lae~pi (a) [(x = i) (x <2 p)"). 


Exercise 12. 


(a) Recall that the trace operator is defined as tr[A] = >> 
identity 


d 
i=l 


[A];;. Prove the matrix 
a’ Ax = tr[Ara"], (8.84) 
where A € R@*¢, 


(b) Show that the likelihood function 


N 
y(D|¥) = It aay spee{ — gen WTEen—w) tL — (885) 
can be written as 
= ul -1)N/2 1 4 ~ T 
VDE) = aa lE on i ce cea |} (8.86) 


(c) Let A = Se sig, and )j,...,Aq be the eigenvalues of A. Show that the result from 
part (b) leads to 


d N/2 d 
: at Sod (8.87) 
p(D|&) = = Xi expe. = ‘er 
(2a) *4? [Syn |NC =1 2 i=1 


Hint: For matrix A with eigenvalues \1,..., Aa, tr[A] = > Vi- 


(d) Find \1,...,Aq such that Equation (8.87) is maximized. 
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(e) With the choice of A; given in (d), derive the ML estimate Sits 


(f) What would be the alternative way of finding Sin? You do not need to prove it. Just 
briefly describe the idea. 


(g) Sur is a biased estimate of the covariance matrix because [Suc] # X. Can you 


nN 


suggest an unbiased estimate Munbias such that De arate] = %? You don’t need to 
prove it. Just state the result. 
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Chapter 9 


Confidence and Hypothesis 


In Chapters 7 and 8 we learned about regression and estimation, which allow us to determine 
the underlying parameters of our statistical models. After obtaining the estimates, we would 
like to quantify the accuracy of the estimates and draw statistical conclusions. Additionally, 
we would like to understand the confidence of these estimates along with their statistical 
significance. This chapter presents a few principles that involve analyzing the confidence of 
the estimates and conducting hypothesis testing. There are two main questions that we will 
address: 


e How good is our estimate? This is a fundamental question about the estimator 6, a 
random variable with a PDF, a mean, and a variance.! The estimator we construct 
today may be different from the estimator we construct tomorrow due to variations in 
the observed data. Therefore, the quality of the estimator depends on the randomness 
and the number of samples used to construct it. To measure the quality of the estimator 
we need to introduce an important concept known as the confidence. 


e Is there statistical significance? Suppose that we ran a campaign and observed that 
there is a change in the statistics. On what basis do we claim that the change is 
statistically significant? How should the cutoff be determined? If we claim that a 
result is statistically significant but there is no significance in reality, how much error 
will we suffer? These questions are the subjects of hypothesis testing. 


These two principal questions are critical for modern data science. If they are not properly 
answered, our statistical conclusions could potentially be flawed. A toy example: 

Imagine that you are developing a COVID-19 vaccine. You tested the vaccine on three 
patients, and all of them show positive responses to the vaccine. You felt excited because 
your vaccine has a 100% success rate. You submit your vaccine application to FDA. Within 
1 second your application is rejected. Why? The answer is obvious. You only have three 
testing samples. How reliable can these three samples be? 

While you are laughing at this toy example, it raises deep statistical questions. First, 
why are three samples not enough? Well, it is because the variance of the estimator can 
potentially be huge. More samples are better because if the estimator is the sample average of 
the individual responses, the estimator will behave like a Gaussian according to the Central 


1Not all random variables have a well-defined PDF, mean, and variance. E.g., a Cauchy variable does 
not have a mean. 
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Limit Theorem. The variance of this Gaussian will diminish as we have more samples. 
Therefore, if we want to control the variance of the estimator, we need more samples. Second, 
even if we have many samples, how confident is this estimator with respect to the unknown 
population parameter? Note that the population parameter is unknown, and so we cannot 
measure things such as the mean squared error. We need a tool to report confidence. Third, 
for simple estimators such as the sample average, we can approximate it by a Gaussian. 
However, if the estimator is more complicated, e.g., the sample median, how do we estimate 
the variance and the confidence? Fourth, suppose that we have expanded the vaccine test 
to, say, 951 patients, and we have obtained some statistics. To what extent can we declare 
that the vaccine is effective? We need a decision rule that turns the statistics into a binary 
decision. Finally, even if we declare that the vaccine is effective with a confidence of 95%, 
what about the remaining 5%? What if we want to push the confidence to 99%? What is 
the trade-off? 

As you can see, these questions are the recurring themes of all data science problems. 
No matter if you are developing a medical diagnostic system, a computer vision algorithm, 
a speech recognition system, a recommendation system, a search engine, stock forecast, 
fraud detection, or robotics controls, you need to answer these questions. This chapter will 
introduce useful concepts related to data analysis in the form of five basic principles: 


1. Confidence interval (Section 9.1). A confidence interval is a random interval that 
includes the true parameter. We will discuss how a confidence interval is constructed 
and the correct way to interpret the confidence interval. 


2. Bootstrapping (Section 9.2). When constructing the confidence interval, we need the 
variance of the estimator. However, since we do not know the true distribution, we 
need an alternative way to estimate the variance. Bootstrapping is designed for this 
purpose. 


3. Hypothesis testing (Section 9.3). Many statistical tasks require a binary decision at 
the end, e.g., there is a disease versus there is no disease. Hypothesis testing is a 
principle for making a systematic decision with statistical guarantees. 


4. Neyman-Pearson decision (Section 9.4). The simple hypothesis testing procedure has 
many limitations that can only be resolved if we understand a more general framework. 
We will study such a framework, called the Neyman-Pearson decision rule. 


5. ROC and PR curves (Section 9.5). No decision rule is perfect. There is always a 
trade-off between how much we can detect and how much we will miss. The receiver 
operating characteristic (ROC) curve and the precision-recall (PR) curve can give us 
more insight into this trade-off. We will establish the equivalence between the ROC 
and the PR curve and correct any misconceptions about them. 


After reading this chapter, we hope that you will be able to apply these principles 
to your favorite data analysis problems correctly. With these principles, you can tell your 
customers or bosses the statistical significance of your conclusions. You will also be able to 
help your friends understand the many misconceptions that they may find on the internet. 
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9.1 Confidence Interval 


The first topic we discuss in this chapter is the confidence interval. At a high level, the 
confidence interval tells us the quality of our estimator with respect to the number of sam- 
ples. We begin this section by reviewing the randomness of an estimator. Then we develop 
the concept of the confidence interval. We discuss several methods for constructing and 
interpreting these confidence intervals. 


9.1.1 The randomness of an estimator 


Imagine that we have a dataset ¥ = {X1,...,X}, where we assume that X,, are iid. 
copies drawn from a distribution fx (x;0). We want to construct an estimator © of 6 from 
the dataset 4’. For example, if fx is a Gaussian distribution with an unknown mean @, we 
would like to estimate 9 using the sample average O. In statistics, an estimator O is also 
known as a Statistic, which is constructed from the samples. In this book we use the terms 
“estimator” and “statistic” interchangeably. Written as equations, an estimator is a function 
of the samples: - 
ey = g(X1,.--,Xn), 


estimator function of Vv 


where g is a function that takes the samples X,...,Xy and returns a random variable 6. 
For example, the sample average 


is an estimator because it is computed by summing the samples X,,...,Xy and dividing it 
by N. 


What is an estimator? 


e An estimator O is a function of the samples X1,..., Xn: 


6 = 9(X,..., Xv). 


e O is a random variable. It has a PDF, CDF, mean, variance, etc. 


By construction, O is a random variable because it is a function of the random samples. 
Therefore, 6 has its own PDF, CDF, mean, variance, etc. Since 6 is a random variable, 
we should report both the estimator’s value and the estimator’s confidence when reporting 
its performance. The confidence measures the quality of © when compared to the true 
parameter 0. It provides a measure of the reliability of the estimator O. If O fluctuates a 
great deal we may not be confident of our estimates. Let’s consider the following example. 
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Example 9.1. A class of 1000 students took a test. The distribution of the score is 
roughly a Gaussian with mean 50 and standard deviation 20. A teaching assistant 
was too lazy to calculate the true population mean. Instead, he sampled a subset of 5 
scores listed as follows: 


——EeEeE>E>E>yx—y———————————SSSS aes 
Student ID 1 3.«4 5 
Scores 11 1 78 82 


aa | 
He calculated the average, which is 53.8. This is a very good estimate of the class 
average (which is 50). What is wrong with his procedure? 


Solution. He was just lucky. It quite possible that if he sampled another 5 scores, he 
would get something very different. For example, if he looks at the 11 to 15 student 
scores, he could get: 


SSS Se Sa es) 
StudentID 11 12 13 14 15 
Scores 44. 29 19 27 15 


In this case the average is 26.8. 

Both 53.8 and 26.8 are legitimate estimates, but they are the random realizations 
of a random variable ©. This © has a PDF, CDF, mean, variance, etc. It may be 
misleading to simply report the estimated allie from a particular instant, so the 
confidence of the estimator must be specified. 


Distributions of ©. We next discuss the distribution of 0. Figure 9.1 illustrates several 
key ideas. Suppose that the population distribution fx(x) is a mixture of two Gaussians. 
Let 6 be the mean of this distribution (somewhere between the two peak locations). We 
sample N = 50 data points X1,...,Xy~ from this distribution. However, the 50 data points 
we sample today could differ from the 50 data points we sample tomorrow. If we compute 
the sample average from each of these finite-sample distributions, we will obtain a set of 
sample averages 0. Notably, we have a set of © because today we have one, 6 and tomorrow 
we have another 0. By plotting the histogram of the sample averages 6, we will have a 
distribution. 7 

The histogram of © depends on several factors. According to Central Limit Theorem, 
the shape of fg(@) is a Gaussian because © is the average of N i.i.d. random variables. 
If © is not the average of i.i.d. random variables, the shape is not necessarily a Gaussian. 
This results in additional complications, so we will discuss some tools for dealing with this 
problem. The spread of the sample distribution is mainly driven by the number of samples 
we have in each subdataset. As you can imagine, the more samples we have in a subdataset 
the more accurate the distribution. Thus you will have a more accurate sample average. The 
fluctuation of the sample average will also be smaller. 

Before we continue, let’s summarize the randomness of 0: 


What is the randomness of 0? 


e Ois generated from a finite-sample dataset. Each time we draw a finite-sample 
dataset, we introduce randomness. 
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Figure 9.1: Pictorial illustration of the randomness of the estimator ©. Given a population, our datasets 
are usually a subset of the population. Computing the sample average from these finite-sample distribu- 
tions introduces the randomness to ©. If we plot the histogram of the sample averages, we will obtain 
a distribution. The mean of this distribution is the population mean, but there is a nontrivial amount of 
fluctuation. The purpose of the concept of confidence interval is to quantify this fluctuation. 


e If © is the sample average, the PDF is (roughly) a Gaussian. If O is not a sample 
average, the PDF is not necessarily a Gaussian. 


e The spread of the fluctuation depends on the number of samples in each sub- 
dataset. 


9.1.2 Understanding confidence intervals 


The confidence interval is a probabilistic statement about O. Instead of studying O asa 
point, we construct an interval 


T=|6-« O+el, (9.2) 


for some € to be determined. Note that this interval is a random interval: If we have a 
different realization of ©, we will have a different Z. We call Z the confidence interval for 
the estimator O. 

Given this random interval, we ask: What is the probability that Z includes 6? That 
means that we want to evaluate the probability 


Pie Z]/=P|6-«<0<6+e 


We emphasize that the randomness in this probability is caused by 6, not @. This is because 
the interval J changes when we conduct a different experiment to obtain a different O. The 
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situation is similar to that illustrated on the left-hand side of Figure 9.2. The confidence 
interval Z changes but the true parameter @ is fixed. 


J 0 6 - 0 4 
O-« O+e d-—e O+e 
t oo - 96 
O-€ O+e d-—e O+e 
L 4 6 = Q 4 
oe O+e Oe O+e 


J a L ) xX 0 C) 


O-e« 6+e d-e d+ 


Figure 9.2: Confidence interval is the random interval Z = [Oo —€, O+ e], not the deterministic interval 
[0 — €«,8 + €]. The random interval in the former case does not require any knowledge about the true 
parameter 6, whereas the latter requires 0. By claiming a 95% confidence interval, we say that there 
is 95% chance that the random interval will include the true parameter. So if you have 100 random 
realizations of the confidence intervals, then 95 on average will include the true parameter. 


Confidence intervals can be confusing. Often the confusion arises because of the fol- 
lowing identity: 


p[6-e<0<6+¢| = P[-e<0-O<¢l 


=P|o-e<6<0+¢. (9.3) 


Although the values of the two probabilities are the same, the two events are interpreted 
differently. The right-hand side of Figure 9.2 illustrates P[0 — « < 0 < 6+ €]. The interval 
[0—e, 0+€| is fixed. What is the probability that the estimator 6 lies within this deterministic 
interval? To find this probability, we need to know the true_parameter @, which is not 
available. By contrast, the other probability PIO — « < 6 < © 4+ «| does not require any 
knowledge about the true parameter 0. What is the probability that the true parameter is 
included inside the random interval? If the probability is high, we say that there is a good 
chance that our confidence interval will contain the true parameter. This is observed in the 
left-hand side of Figure 9.2, 7 

In practice we often set P[O — « < 6 < © +€| to be greater than a certain confidence 
level, say 95%, and then we determine e. Once we have determined ¢€, we can claim that 
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with 95% probability the interval [6 — €, © + ¢] will include the unknown parameter 0. We 
do not need to know @ at any point in this process. 

To make this more general, we define 1 — a as the confidence level for some parame- 
ter a. For example, if we would like to have a 95% confidence level, we set a = 0.05. Then 
the probability inequality 

p[O-e<0<6+e Siow (9.4) 
tells us that there is at least a 95% chance that the random interval J = [6 —e, O+ €] will 


include the true parameter 6. In this case we say that Z is a “95% confidence interval”. 


What is a 95% confidence interval? 


e It is a random interval [9 — ¢,© + ¢] such that there is 95% probability for it to 
include the true parameter 0. 


e It is not the deterministic interval [0 — €,@ + «], because we never know 0. 


Example 9.2. After analyzing the life expectancy of people in the United States, it 
was concluded that the 95% confidence interval is (77.8, 79.1) years old. Is the following 
claim valid? 


About 95% of the people in the United States have a life expectancy between 77.8 
years old and 79.1 years old. 


Solution. No. The confidence interval tells us that with 95% probability the random 
interval (77.8, 79.1) will include the true average. We emphasize that (77.8, 79.1) is 
random because it is constructed from a small set of data points. If we survey another 
set of people we will have another interval. 

Since we do not know the true average, we do not know the percentage of people 
whose life expectancy is between 77.8 years old and 79.1 years old. It could be that the 
true average is 80 years old, which is out of the range. It could also be that the true 
average is 77.9 years old, which is within the range, but only 10% of the population 
may have life expectancy in (77.8, 79.1). 


Example 9.3. After studying the SAT scores of 1000 high school students, it was 
concluded that the 95% confidence interval is (1134, 1250) points. Is the following 
claim valid? 


There is a 95% probability that the average SAT score in the population is in the 
range 1134 and 1250. 


Solution. Yes, but it can be made clearer. The average SAT score in the population 
remains unknown. It is a constant and it is deterministic, so there is no probability 
associated with it. A better way to say this is: “There is 95% probability that the 
random interval 1134 and 1250 will include the average SAT score.” We emphasize 
that the 95% probability is about the random interval, not the unknown parameter. 
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9.1.3 Constructing a confidence interval 


Let’s consider an example. Suppose that we have a set of i.i.d. observations X1,...,Xn 
that are Gaussians with an unknown mean @ and a known variance o?. We consider the 
maximum-likelihood estimator, which is the sample average: 


Our goal is to construct a confidence interval. 


Population distribution 1606 Histogram of the sample average 


===== Population HE Sample Average 


0.5 


| a i sl, in. ! ly 
ce z* 


KF 


Empirical histograms 


Figure 9.3: Conceptual illustration of how to construct a confidence interval. Starting with the pop- 
ulation, we draw random subsets. Each random subset gives us an estimator, and correspondingly an 
interval. 


Before we consider the equations, let’s look at a graph illustrating what we want to 
achieve. Figure 9.3 shows a population distribution, which is a Gaussian in this example. 
We draw N samples from the Gaussian to construct a random subset. Based on this random 
subset we construct the estimator ©. Since this estimator is based on the particular random 
subset we have, we can follow the same approach by drawing another random subset. To 
differentiate the estimators constructed by the different random subsets, let’s call the esti- 
mators 09) and 6(), respectively. For each estimator we construct an interval [O—e«, O+¢] 
to obtain two diferent intervals: 


DP=(69-66M4g and P=[6@%-6 6@+¢. 


If we can determine €, we have found the confidence interval. 
We can determine the confidence interval by observing the histogram of 6, which in 
our case is the histogram of the sample average, since the histogram of 0 is wall defined, 
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especially if we are looking at the sample average. The histogram of the sample average is a 
Gaussian because the average of N i.i.d. Gaussian random variables is Gaussian. Therefore, 
the width of this Gaussian is determined by the answer to this question: 


For what € can we cover 95% of the histogram of 6? 


To find the answer, we set up the following probability inequality: 


|6 — E[6]| 
Var[O] 


P 


This probability says that we want to find an e such that the majority of O is living close 
to its mean. The level 1 — a is our confidence level, which is typically 95%. Equivalently, we 
let a = 0.05. 

In the above equation, we can define the quotient as 


gut Eel 


Var|6] 


We know that Z is a zero-mean unit-variance Gaussian because it is the standardized vari- 
able. [Note: Not all normalized variables are Gaussian, but if O is a Gaussian the normalized 
variable will remain a Gaussian.] Thus, the probability inequality we are looking at is 


P 2 |< ( > l-a. 
two tails of a standard Gaussian 


The PDF of Z is shown in Figure 9.4. As you can see, to achieve 95% confidence we need 
to pick an appropriate € such that the shaded area is less than 5%. 


3 -2.5 -2 -1.5 -1 -05 0 05 1 15 2 25 3 


Figure 9.4: PDF of the random variable Z = (© — E[O})/\/Var[O]. The shaded area denotes the 
a = 0.05 confidence level. 


Since PZ < ¢] is the CDF of a Gaussian, it follows that 


P|Z| <¢]=Pl-e< Z<¢] 
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Using the symmetry of the Gaussian, it follows that ®(—e) = 1— ®(e) and hence 
PI|Z| < ] =26(€)-1. 
Equating this result with the probability inequality P||Z | < «] > 1—a, we have that 
eo ae (1 - <) 
= 2 


The remainder of this problem is solvable on a computer. On MATLAB, we can call 
icdf to compute the inverse CDF of a standard Gaussian. On Python, the command is 
stats.norm.ppf. The commands are as shown below. 


% MATLAB code to compute the width of the confidence interval 
alpha = 0.05; 

mu = 0; sigma = 1; % Standard Gaussian 

epsilon = icdf(’norm’ ,1-alpha/2,mu,sigma) 


# Python code to compute the width of the confidence interval 
import scipy.stats as stats 

alph = 0.05; 

mu 0; sigma = 1; # Standard Gaussian 

epsilon = stats.norm.ppf(1-alph/2, mu, sigma) 

print (epsilon) 


If everything is done properly, we see that for a 95% confidence level (a = 0.05) the corre- 
sponding € is « = 1.96. 

After determining e, it remains to determine [6] and Var[6] in order to complete the 
probability inequality. To this end, we note that 


de {2° 
l=" |EX 


a Tag a 
Var[9] — Var ed = WN’ 


= 06, 


if we assume that the population distribution is Gaussian(@,07), where @ is unknown but o 
is known. Substituting these into the probability inequality, we have that 


|O — EO 
Var|6] 


P 


A oO a oO 
=P|0-1.96— <@<6+1.96—], 
| vr = a 


where we let « = 1.96 for a 95% confidence level. Therefore, the 95% confidence interval is 


a a a or 
0 — 1.96—, 0+41.96—]}. 9.5 

b- 1967 in| a 
As you can see, we do not need to know the value of @ at any point of the derivation because 
the confidence interval in Equation (9.5) does not involve 0. This is an important difference 
with the other probability P[@ — « < © < 6+], which requires 6. 


552 


9.1. CONFIDENCE INTERVAL 


How to construct a confidence interval 


e Compute the estimator 6. 


e Determine the width of the confidence interval « by inspecting the confidence 
level 1 — a. If © is Gaussian, then e = 6~1(1— $). 


e If © is not a Gaussian, replace the Gaussian CDF by the CDF of 6. 


e The confidence interval is [6 — €, © + €]. 


9.1.4 Properties of the confidence interval 


Some important properties of the confidence interval are listed below. 


° Probability of © is the same as probability of Z. First, the two random variables © 
and Z have a one-to-one correspondence. We proved the following in Chapter 6: 


If 6 ~ Gaussian(0, 


~ Gaussian(0, 1). 


2 


For example, if 6 ~ Gaussian(#, >) with N = 1, 8 = 1 and o = 2, then a 95% 
confidence level is 


0.95 © P[—1.96 < 96],  (Z is within 1.96 std from Z’s mean) 


= P[-2.92 <0 < 4.92]. (0 is within 1.96 std from 6’s mean) 


Note that while the range for Z is different from the range for 6, they both return the 
same probability. The only difference is that © is constructed before the normalization 
and Z is constructed after the normalization. 

e Standard error. In this estimation problem we know that O is the sample average. We 
assume that the mean @ is unknown but the variance Var[{6] is known. The standard 
deviation of © is called the standard error: 


se = \/ Var[O] = Ti (9.7) 


e Critical value. The value 1.96 in our example is often known as the critical value. It 
is defined as 


fq = O71 (1 - =) (9.8) 
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The zq value gives us a multiplier applied to the standard error that will result in a 
value within the confidence interval. This is because, by the definition of the confidence 
interval, the interval is 


ol a a es x 
0 — 1.96 —, 0+1.96-—~] = |O-2z, se, O+ zQ5¢€ 


e Margin of error. The margin of error is defined as 


ol 
margin of error = Zzy——=. 9.9 

g UN (9.9) 

The margin of error is also the width of the confidence interval. As the name implies, 


the margin of error tells us how much error the confidence interval includes when 
predicting the population parameter. 


Practice Exercise 9.1. Suppose that the number of photos a Facebook user uploads 
per day is a random variable with o = 2. In a set of 341 users, the sample average is 
2.9. Find the 90% confidence interval of the population mean. 


Solution. We set a = 0.1. The z,-value is 
f= o (1 = =) = 1.6449. 


The 90% confidence interval is then 


2 ie 2 
6 — 1.64 , 64+1.64 = (2:72, 3.08]. 
V341 a | 


Therefore, with 90% probability, the interval [2.72, 3.08] includes the population mean. 


Example 9.4. Professional cyber-athletes have a standard deviation of 0 = 73.4 
actions per minute. If we want to estimate the average actions per minute of the 
population, how many samples are needed to obtain a margin of error < 20 at 90% 
confidence? 


Solution. With a 90% confidence level, the z,-value is 
zq = o7} (1 = =) = 6-1(0.95) = 1.645. 


The margin of error is 20. So we have Za TN = 20. Moving around the terms gives us 


N> as = 36.45. 


Therefore, we need at least N = 37 samples to ensure a margin of error of < 20 at a 
90% confidence level. 
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Figure 9.5: Relationships between the standard error se, the zq value, and the margin of error. The 
confidence level a is the area under the curve for the tails of each PDF. 


The concepts of standard error se, the zq value, and the margin of error are summarized 
in Figure 9.5. The left-hand side is the PDF of Z. It is the normalized random variable, 
which is also the standard Gaussian. The right-hand side is the PDF of 0, the unnormalized 
random variable. The z, value is located in the Z-space. It defines the range of Z in the 
PDF within which we are confident about the true parameter. The corresponding value 
in the 6-space is the margin of error. This is found by multiplying zg with the standard 
deviation of 0, known as the standard error. Correspondingly, in the Z-space the standard 
deviation is the unity. 

Two further points about the confidence interval should be mentioned: 


e Number of Samples NV. The confidence interval is a function of N. As we increase the 
number of samples, the distribution of the estimator O becomes narrower. Specifically, 
if © follows a Gaussian distribution 

S.Cumiaal &: © 
~ Gaussian — 
? N o] 
then © % 6 as N — oo. Figure 9.6 illustrates a few examples of Oas N grows. In the 
limit when N — co, we observe that the interval becomes 


a oO ~ Oo ~~ oe rs 
6-196-—, 64+1.9s-—| — |6, 6] =6. 
[e196 7 | [8 4] 


In this case, the statement @ © [6 - 1.96, 0 + 1.96 becomes 0 = ©. That 


o 
‘a 
means the estimator © returns the correct true parameter 4. Of course, it is possible 
that E[O] 4 6, ie., the estimator is biased. In that case, having more samples will 


approach another estimate that is not @. 


e Distribution of Z. When defining the confidence interval we constructed an interme- 
diate variable - 

O-¢@ 

a/VN- 

Since X,,’s are i.i.d. Gaussian, it follows that Z is also Gaussian. This gives us a way 

to calculate the probability using the standard Gaussian table. What happens when 

X,’s are not Gaussian? The good news is that even if X,,’s are not Gaussian, for 


a= 
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Figure 9.6: The PDF of © as the number of samples N grows. Here, we assume that X,, are i.i.d. 
Gaussian random variables with mean @ = 0 and variance o? = 1. 


sufficiently large N, the random variable © is more or less Gaussian, because of the 
Central Limit Theorem. Therefore, even if X,,’s are not Gaussian we can still use the 
Gaussian probability table to construct a and e. 


9.1.5 Student’s ¢-distribution 


In the discussions above, we estimate the population mean @ using the estimator ©. The 
assumption was that the variance o? was known a priori and hence is fixed. In practice, 
however, there are many situations where o? is not known. Thus we not only need to use 
the mean estimator © but also the variance estimator S, which can be defined as 

ie it 
ss ——_) (xX, -— 6)’, 

Woy kn - 8) 
n=1 
where © is the estimator of the mean. What is the confidence interval for 6? 
For a confidence interval to be valid, we expect it to take the form of 


a ay 


a S 
T= |\O- % ; ry Za ; 
VN VN 


which is essentially the confidence interval we have just derived but with o replaced by S. 
However, there is a problem with this. When we derive the confidence interval assuming a 
known o, the zq value is determined by checking the standard Gaussian 


6-80 
o/VN’ 


which gives us zq = ®~1(1 — a/2). The whole derivation is based on the fact that Zisa 
standard Gaussian. Now that we have replaced o by S, the new random variable 


Z- 


poet O-8¢6 


oN (9.10) 
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is not a standard Gaussian. 
It turns out that the distribution of T is Student’s t-distribution with N — 1 degrees 
of freedom. The PDF of Student’s t-distribution is given as follows. 


Definition 9.1. If X is a random variable following Student’s t-distribution of v 
degrees of freedom, then the PDF of X is 


(9.11) 


We may compare Student’s t-distribution with the Gaussian distribution. Figure 9.7 shows 
the standard Gaussian and several t distributions with v = N — 1 degrees of freedom. Note 
that Student’s ¢-distribution has a similar shape to the Gaussian but it has a heavier tail. 


0.4 T T T T T 
0.35 | === Gaussian(0,1) | - 
—©—t-dist, N= 11 
0.3 | —4—tdist,N=3 |) 
0.25 —e—-tdist,N=2 | - 
0.2 


0.15 
0.1 
0.05 


of i A 
5 4 -3 -2 -1 0 1 2 3 4 5 


Figure 9.7: The PDF of Student's ¢-distribution with v = N — 1 degrees of freedom. 


Since T = Sdn is a t-random variable, to determine the z. value we can follow the 


same procedure by considering the CDF of T. Let the CDF of the Student’s ¢-distribution 
with v degrees of freedom be 


W,(z) = CDF of X at z. 


If we want P[|Z'| < za] = 1—, it follows that 


Za = Vz} (1 = =) (9.12) 
Therefore, the new confidence interval, assuming an unknown S , is 


LL 


zs S « g 
O-z—=, 9+ %.——=!], 
va Ty . a 


with zq defined in Equation (9.12), using vy = N — 1. 
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Practice Exercise 9.2. A survey asked N = 14 people for their rating of a movie. As- 
sume that the mean estimator is © and the variance estimator is S. Find the confidence 
interval. 


Solution. If we use Student’s t-distribution, it follows that 


Za = Ui (1- =) = 2.16, 


where the degrees of freedom are vy = 14 — 1 = 13. Thus the confidence interval is 


6 + 2.16 | 


The MATLAB and Python codes to report the zg value of a Student’s ¢-distribution 
are shown below. They are both called through the inverse CDF function. In MATLAB it 
is icdf, and in Python it is stats.t.ppf. 


% MATLAB code to compute the z_alpha value of t distribution 
alpha = 0.05; 
nu = 13; 

icdf (norm, 1-alpha/2,nu) 


# Python code to compute the z_alpha value of t distribution 
import scipy.stats as stats 
alph = 0.05 
13 
stats.t.ppf(1-alph/2, nu) 
print (z) 


Example 9.5. A class of 10 students took a midterm exam. Their scores are given in 
the following table. 


————— eee] 
Student 1 23) A oe On oe 2 90 
Score 72 69 75 58 67 70 60 71 59 65 
EE 


Find the 95% confidence interval. 


nan 


Solution. The mean and standard deviation of the datasets are respectively O = 66.6 
and S = 5.61. The critical z_ value is determined by Student’s t-distribution: 


Za = U51 (1- =) = 2.26. 
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The confidence interval is 


nw = 


a Ss a S 
6 GAs | Sihono Foe, 


Therefore, with 95% probability, the interval [62.59, 70.61] will include the true popu- 
lation mean. 


Remark 1. Make sure you understand the meaning of “population mean” in this 
example. Since we have ten students, isn’t the population mean just the average of the 
ten scores? This is incorrect. In statistics, we assume that these ten students are the 


realizations of some underlying (unknown) random variable X with some PDF fx(z). 
The population mean @ is therefore the expectation E[X], where the expectation is 
taken w.r.t. fx. The sample average 6, which is the average of the ten numbers, is an 
estimator of the population mean 6. 


Remark 2. You may be wondering why we are using Student’s ¢-distribution here 
when we do not even know the PDF of X. The answer is that it is an approximation. 
When X is Gaussian, the sample average O is a Student’s t-distribution, assuming 
that the variance is approximated by the sample variance S. This result is attributed 
to the original paper of William Gosset, who developed Student’s ¢-distribution. 


The above example can be solved computationally. An implementation through Python 
is given below, and the MATLAB implementation is straightforward. 


# Python code to generate a confidence interval 
import numpy as np 

import scipy.stats as stats 

xX = np.array([72, 69, 75, 58, 67, 70, 60, 71, 59, 65]) 
N = x.size 

Theta_hat = np.mean(x) # Sample mean 

S_hat np.std(x) # Sample standard deviation 
nu = x.size-1 # degrees of freedom 

alpha = 0.05 # confidence level 

Zz = stats.t.ppf(1-alph/2, nu) 

CI_L = Theta_hat-z*S_hat/np.sqrt (N) 


CI_U = Theta_hat+z*S_hat/np.sqrt (N) 
print(CI_L, CI_U) 


What is Student’s ¢-distribution? 


e It was developed by William Gosset in 1908. When he published the paper he 
used the pseudonym Student. 


e We use Student’s ¢-distribution to model the estimator 0’s PDF when the vari- 
ance o? is replaced by the sample variance $7. 


e Student’s t-distribution has a heavier tail than a Gaussian. 


559 


CHAPTER 9. CONFIDENCE AND HYPOTHESIS 


9.1.6 Comparing Student’s t-distribution and Gaussian 


We now discuss an important theoretical result regarding the relationship between a Stu- 
dent’s t-distribution and Gaussian distribution. The main result is that the standard Gaus- 
sian is a limiting distribution of the t distribution as the degrees of freedom v — oo. 


Theorem 9.1. As v — o, the Student’s t-distribution approaches the standard Gaus- 
sian distribution: 


(9.13) 


The proof of the theorem requires Stirling’s approximation, which is not essential for this 
book. Feel free to skip it if needed. 


Proof. There are two results we need to use: 
e Stirling’s approximation:? T(z) © ,/2" (2)*. 
e Exponential approximation: (1 + Z)—k se", ask oo. 


We have that 


Putting a limit of v — oo, we have that 


' 1 (: ‘ ‘) 2 i 4 i 
im = ez = : 
v>oo 1/27e V 27e V20 


The other limit follows from the fact that 


t? =a t? 
lim (1+5) =e 2, 
v—-oo V 


Combining the two limits proves the theorem. 


2K. G. Binmore, Mathematical analysis: A straightforward approach. Cambridge University Press, 1977. 
Section 17.7.2. 
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End of the proof. Please join us again. 


This theorem has several implications: 
e When JN is large, S$? + o?. The Gaussian approximation kicks in, and so Student’s 
t-distribution is more or less the same as the Gaussian. 


e Student’s t-distribution is better for small N, usually N < 30. If N > 30, using the 
Gaussian approximation suffices. 


e If X is Gaussian, Student’s t-distribution is an excellent model. If X is not Gaussian, 
Student’s t-distribution will have some issues unless N increases. 


9.2 Bootstrapping 


When estimating the confidence interval, we focus exclusively on the sample average 6 = 
(1/N aa X,,. There are, however, many estimators that are not sample averages. For 
example, we might be interested in an estimator that estimates the sample median: 6 = 
median{X1,...,X,}. In such cases, the Gaussian-based analysis or the Student’s t-based 
analysis we just derived would not work. 

Stepping back a little further, it is important to understand the hierarchy of estimation. 
Figure 9.8 illustrates a rough breakdown of the various techniques. On the left-hand side 
of the tree, we have three point estimation methods: MLE, MAP, and MMSE. They are 
so-called point estimation methods because they are reporting a point — a single value. 
This stands in contrast to the right-hand side of the tree, in which we report the confidence 
interval. Note that point estimates and confidence intervals do not conflict with each other. 
The point estimates are used for the actual engineering solution and the confidence intervals 
are used to report the confidence about the point estimates. Under the branch of confidence 
intervals we discussed sample average. However, if we want to study an estimator that is 
not the sample average, we need the technique known as the bootstrapping — a method 
for estimating the confidence interval. Notably, it does not give you a better point estimate. 

As we have frequently emphasized, since © is a random variable, it has its own PDF, 
CDF, mean, variance, etc. The confidence interval introduced in the previous section pro- 
vides one way to quantify the randomness of 0. Throughout the derivation of the confidence 


nan 


interval we need to estimate the variance Var(©). For simple problems such as the sample 
average, analyzing Var(O) is not difficult. However, if 6 is a more complicated statistic, e.g., 
the median, analyzing Var(O) may not be as straightforward. Bootstrapping is a technique 
that is suitable for this purpose. 

Why is it difficult to provide a confidence interval for estimators such as the median? 


A couple of difficulties arise: 


e Many estimators do not have a simple expression for the variance. For simple esti- 
mators such as the sample average 9 = (1/N) ay Xn, the variance is o?/N. If the 
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Estimation 
Point Estimation Confidence Interval 
MMSE sample mean other estimators 
AV I\ \ 
Ridge LASSO known unknown bootstrap 
Regression variance variance 


Gaussian Student’st 


Figure 9.8: Hierarchy of estimation. Bootstrapping belongs to the category of confidence interval. It is 
used to report the confidence intervals for estimators that are not the sample averages. 


estimator is the median © = median] ..., Xv}, the variance of © will depend on 
the underlying distribution of the X,,’s. If ihe estimator is something beyond the sam- 
ple median, the variance of 6 can be even more complicated to determine. Therefore, 
techniques such as Central Limit Theorem do not apply here. 


e We typically have only one set of data points. We cannot re-collect more i.i.d. samples 
to estimate the variance of the estimator. Therefore, our only option is to squeeze the 
information from the data we have been given. 


When do we use bootstrapping? 
e Bootstrapping is a technique to estimate the confidence interval. 


e We use bootstrapping when the estimator does not have a simple expression for 
the variance. 


e Bootstrapping allows us to estimate the variance without re-collecting more data. 


e Bootstrapping does not improve your point estimates. 


9.2.1 A brute force approach 


Before we discuss the idea of bootstrapping, we need to elaborate on the difficulty of esti- 
mating the variance using repeated measurements. Suppose that we somehow have access to 
the population distribution. Let us denote the CDF of this population distribution by F'x, 
and the PDF by fx. By having access to the population distribution we can synthetically 
generate as many samples X,,’s as we want. This is certainly hypothetical, but let’s assume 
that it is possible for now. 

If we have full access to the population distribution, then we are able to draw K 
replicate datasets ¥!,...,4* from Fx: 
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x) = eG mere gen on Fx, 
XO = {XP KO W Fy, (9.14) 


HE) = {XO KGL By. 


Each dataset ¥(*) contains N data points, and by virtue of i.i.d. all the samples have the 
same underlying distribution Fy. 

For each dataset we construct an estimator © = g(-) for some function g(-). The 
estimator takes the data points of the dataset ¥ and returns a value. Since we have kK 
datasets, correspondingly we will have K estimators: 


Mag aes: (9.15) 


Note that these estimators g(-) can be anything. It can be the sample average or it can be 
the sample median. There is no restriction. a 

Since we are interested in constructing the confidence interval for ©, we need to analyze 
the mean and variance of 0. The true mean and the estimated mean of © are 


[6] = true mean of 0, (9.16) 
M(0) = estimated mean based on 0), ...,O(*) 
def i © Ot) = i 5 9(2) (9.17) 
et si : . 


respectively. Similarly, the true variance and the estimated variance of 6 are 


Var[O] = true variance of 0, (9.18) 
v(@) = estimated variance based on 0), ..., 0) 
K 
def 1 S> (6) _ (Sy) 
 K (8 m()) 
k=1 
Ly (K) 6)) 
=-> (0% = m(6)) (9.19) 
k=1 


These two equations should be familiar: Since © is a random variable, and {OW} are i.i.d. 
copies of 6, we can compute the average of 6M), hee O'*) and the corresponding variance. 
As the number of repeated trials kK approaches oo, the estimated variance V(Q) will converge 
to Var(0) according to the law of large numbers. 

We can summarize the procedure we have just outlined. To produce an estimate of the 


variance, we run the algorithm below. 
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Algorithm 1: Brute force method to generate an estimated variance 


e Assume: We have access to Fy. 
e Step 1: Generate datasets ¥),...,4(*) from Fy. 


n n 


e Step 2: Compute M(O0) and V(O) based on the samples. 
e Output: The estimated variance is V(6). 


The problem, however, is that we only have one dataset VY“). We do not have access to 
X),...,¥(), and we do not have access to Fy. Therefore, we are not able to approxi- 
mate the variance using the above brute force simulation. Bootstrapping is a computational 
technique to mimic the above simulation process by using the available data in V7. 


9.2.2 Bootstrapping 


The idea of bootstrapping is illustrated in Figure 9.9. Imagine that we have a population 
CDF Fx and PDF fx. The dataset we have in hand, 1’, is a collection of the random realiza- 
tions of the random variable X. This dataset ¥ contains N data points Y = {Xj,...,Xwy}. 


20 9} me 


0.3| [E==Population 


(y) 


300 
(y®) ) 200 
100 
c 0 


(y*)) 3.5 4 4.5 5 5.5 


5 0 5 10 °s 0 5 10 


Figure 9.9: A conceptual illustration of bootstrapping. Given the observed dataset 1’, we synthetically 
construct K bootstrapped datasets (colored in yellow) by sampling with replacement from Y. We 
then compute the estimators, e.g., computing the median, for every bootstrapped dataset. Finally, we 
construct the estimator’s histogram (in blue) to compute the bootstrapped mean and variance. 


In bootstrapping, we synthesize K bootstrapped datasets Y“,...,°*), where each 


bootstrapped dataset Y) consists of N samples redrawn from ’. Essentially, we draw with 
replacement N samples from the observed dataset 1: 


yo = yi, ba YO} = N random samples from ¥, 


YR) = yy, tie Ye} = N random samples from ¥. 
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Afterward, we construct our estimator 6 according to our desired function g(-). For example, 
if g(-) = median, we have 


OP, = g(V) = median(y), 


Ore, ZF gy) = median(Y(*)). 


Then, we define the bootstrapped mean and the bootstrapped variance as 


kK 
ss 1 ee 
Mboot (O) — K S- ol) (9.20) 
k=1 
A 1 i. A(k pend 
Vooot(8) = = >> (Of. — Misoot(6)) (9.21) 
k=1 


The procedure we have just outlined can be summarized as follows. 


Algorithm 2: Bootstrapping to generate an estimated variance 


Assume: We do NOT have access to Fx, but we have one dataset ¥. 


Step 1: Generate datasets Y™,...,Y*) from X, by sampling with replacement from 
Xx. 


Step 2: Compute Mpootr(O) and Vpoor(O) based on the samples. 


Output: The bootstrapped variance is Vpoot (0). 


The only difference between this algorithm and the previous one is that we are not synthe- 
sizing data from the population but rather from the observed dataset %. 

What makes bootstrapping work? The basic principle of bootstrapping is based on 
three approximations: 


“a. (a) AS 
Var p(O) ye Veu(O) 


SU 


( 


oO 


) 


Var p(O) © Voor (9) 


2 


nan 


In this set of equations, the ultimate quantity we want to know is Varr(O), which is the 
variance of © under F. (By “under F” we mean that the variance was found by integrating 
with respect to the distribution Fy.) However, since we do not have access to F’, we have 
to approximate Var p(O) by Veun(®). Veun(®) is the sample variance computed from the 
K hypothetical datasets #,..., 4). We call it “full” because we can generate as many 
hypothetical datasets as we want. It is marked as the approximation (a) above. 

In the bootstrapping world, we approximate the underlying distribution F' by some 
other distribution F'. For example, if F is the CDF of a Gaussian distribution, we can 
choose F to be the finite-sample staircase function approximating fF’. In our case, we use the 
observed dataset ¥ to serve as a proxy F to F’. This is the second approximation, marked by 
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(b). Normally, if you have a reasonably large 1’, it is safe to assume that this finite-sample 
dataset VY has a CDF F that is close to the true CDF F. 7 

The third approximation is to find a numerical estimate Varg(@) via the simulation 
procedure we have just outlined. This is essentially the same line of argument for (a) but 
now applied to the bootstrapping world. We mark this approximation by (c). Its goal is to 
approximate Var 7(9) via Vpoot (9). 

The three approximations have their respective influence on the accuracy of the boot- 
strapped variance: 


How does bootstrapping work? 


e It is based on three approximations: 


e (a): A hypothetical approximation. The best we can do is that we have access 
to F’. It is practically impossible to achieve, but it gives us intuition. 


e (b): Approximate F’ by F, where F is the empirical CDF of the observed data. 


This is usually the source of error. The approximation error reduces when you 
use more samples to approximate F’. 


e (c): Approximate the theoretical bootstrapped variance by a finite approxima- 
tion. This approximation error is usually small because you can generate as many 
bootstrapped datasets as you want. 


One “mysterious” property of bootstrapping is the sampling with replacement scheme 
used to synthesize the bootstrapped samples. The typical questions are: 


e (1) Why does sampling from the observed dataset 1 lead to meaningful boot- 


strapped datasets V“),...,‘*)? To answer this question we consider the following 
toy example. Suppose we have a dataset VY containing N = 20 samples, as shown 
below. 


X=f000000111122222222 2 2) 


This dataset is generated from a random variable X with a PDF f having three states: 
0 (30%), 1 (20%), 2 (50%). As we draw samples from ¥, the percentage of the states 
will determine the likelihood of one state being drawn. For example, if we randomly 
pick a sample Y, from %, we have a 30% chance of having Y, to be 0, 20% chance 
of having it to be 1, and 50% chance of having it to be 2. Therefore, the PDF of Y,, 
(the randomly drawn sample from 1) will be 0 (30%), 1 (20%), 2 (50%), the same 
as the original PDF. If you think about this problem more deeply, by “sampling with 
replacement” we essentially assign each X,, with an equal probability of 1/N. If one 
of the states is more popular, the individual probabilities will add to form a higher 
probability mass. 


e (2) Why can’t we do sampling without replacement, aka permutation? We need to 
understand that sampling without replacement is the same as permuting the data in Vv. 
By permuting the data in 1’, the simple probability assignments such as PLX = 0] = , 
P[X = 1] = ¥ and P[X = 2] = 38 will be destroyed. Moreover, permuting the data 
does not change the mean and variance of the data because we are only shuffling the 
order. As far as constructing the confidence interval is concerned, shuffling the order 


is not useful. 
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On computers it is easy to generate the bootstrapped dataset, along with their mean 
and variance. In MATLAB the key step is to call a for loop. Inside the for loop, we draw 
N random indices randi from 1 to N and pick the samples. The estimator Thetahat is then 
constructed by calling your target estimator function g(-). In this example the estimator is 
the median. After the for loop, we compute the mean and variance of ©. These are the 
bootstrapped mean and variance, respectively. 


MATLAB code to estimate a bootstrapped variance 
= [72, 69, 75, 58, 67, 70, 60, 71, 59, 65]; 
size(X,2); 
1000; 
Thetahat = zeros(1,K); 
for i=1:K repeat K times 


idx = randi(N,[1, NJ); sampling w/ replacement 
Y = X(idx); 
Thetahat(i) = median(Y) ; estimator 


mean (Thetahat) bootstrapped mean 
var (Thetahat) bootstrapped variance 


The Python commands are similar. We call np.random.randint to generate random 
integers and we pick samples according to Y = X[idx]. After generating the bootstrapped 
dataset, we compute the bootstrap estimators Thetahat. 


# Python code to estimate a bootstrapped variance 
import numpy as np 
np.array([72, 69, 75, 58, 67, 70, 60, 71, 59, 65]) 
X.size 
1000 
Thetahat = np.zeros(K) 


for i in range(K): 
idx = np.random.randint(N, size=N) 
Y = X[idx] 
Thetahat [i] = np.median(Y) 
= np.mean(Thetahat) 
np. var (Thetahat) 


After we have constructed the bootstrapped variance, we can define the bootstrapped 
standard error as 


SEboot = Vboot(O). (9.22) 
Accordingly we define the bootstrapped confidence interval as 


TL = [6 — zoSpoot; 8+ 2aBboot]; (9.23) 


where Z, is the critical value of the Gaussian. 
The validity of the confidence intervals constructed by bootstrapping is subject to 
the validity of z.. If © is roughly a Gaussian, the bootstrapped confidence interval will be 
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reasonably good. If O is not Gaussian, there are advanced methods to replace z4 with better 
estimates. This topic is beyond the scope of this book; we refer interested readers to Larry 
Wasserman, All of Statistics, Springer 2003, Chapter 8. 


9.3 Hypothesis Testing 


Imagine that you are a vaccine company developing COVID-19 vaccines. You gave the 
vaccine to 934 patients, and 928 patients have developed antigens. How confident can you 
be that your vaccine is effective? Questions like this are becoming more common nowadays 
in situations in which we need to make statistically informed choices between YES and NO. 
The subject of this section is hypothesis testing — a principled statistical procedure used 
to evaluate statements that should be accepted or rejected. 


9.3.1 What is a hypothesis? 


A hypothesis is a statement that requires testing by observation to determine whether it is 
true or false. A few examples: 


e The coin is unbiased. 

e Students entering the graduate program have GPA > 3. 
e More people like orange juice than lemonade. 

e Algorithm A performs better than Algorithm B. 


As you can see from these examples, a hypothesis is something we can test based on the 
data. Therefore, being “correct” or “wrong” depends on the statistics we have and the cutoff 
threshold. Accepting or rejecting a hypothesis does not mean that the statement is correct 
or wrong, since the truth is unknown. If we accept a hypothesis, we have made a better 
decision solely based on the statistical evidence. It is possible that tomorrow when you have 
collected more data we may reject a previously accepted hypothesis. 

The procedure for testing whether a hypothesis should be accepted or rejected is known 
as hypothesis testing. In hypothesis testing, we often have two opposite hypotheses: 


e Ho: Null hypothesis. It is the “status quo”, or the current status. 

e H,: Alternative hypothesis. It is the alternative to the null hypothesis. 
To better understand hypothesis testing, consider a courthouse. By default, any person 
being prosecuted is assumed to be innocent. The police need to show sufficient evidence in 
order to prove the person guilty. The null hypothesis is the default assumption. Hypothesis 


testing asks whether we have strong enough evidence to reject the null hypothesis. If our 
evidence is not strong enough, we must assume that the null hypothesis is possibly true. 


Example 9.6. Suggest a null hypothesis and an alternative hypothesis regarding 
whether a coin is unbiased. 


Solution: Let @ be the probability of getting a head. 
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e Ho: 6=0.5, and H,: 6 > 0.5. This is a one-sided alternative. 
e Ho: 6=0.5, and Hy: 6 < 0.5. This is another one-sided alternative. 
e Ho: 0=0.5, and Hy: 06 # 0.5. This is a two-sided alternative. 


Practice Exercise 9.3. Suggest a null and an alternative hypothesis regarding whether 
more than 62% of people in the United States use Microsoft Windows. 


Solution: Let 6 be the proportion of people using Microsoft Windows in United States. 
e Ho: 0 > 0.62, and H;: 8 < 0.62. This is a one-sided alternative. 


Practice Exercise 9.4. Suggest a null and an alternative hypothesis regarding whether 
self-checkout at Walmart is faster than using a cashier. 


Solution: Let @ be the proportion of people that check out faster with self-checkout... 


e Ho: 0 > 0.5, and Hy: 0 < 0.5. This is a one-sided alternative. 


9.3.2 Critical-value test 


In hypothesis testing, there are two major approaches: the critical-value test, and the 
p-value test. The two tests are more or less equivalent. If you reject the null hypothesis using 
the critical-value test, you will reject the hypothesis using the p-value. In this subsection, 
we will discuss the critical-value test. Let us consider a toy problem: 

Suppose that we have a 4-sided die and our goal is to test whether the die is unbiased. 
To do so, we define the null and the alternative hypotheses as 


e Ho: 6 = 0.25, which is our default belief. 
e A: 6 > 0.25, which is a one-sided alternative. 


There is no particular reason for considering the one-sided alternative other than the fact 
that the calculation is slightly easier. You are welcome to consider the two-sided alternative. 
We must obtain data prior to conducting any hypothesis testing. Let’s assume that we 
have thrown the die N = 1000 times. We find that “3” appears 290 times (we could just as 
well have chosen 1, 2, or 4). We let X1,...,X1000 be the N = 1000 binary random variables 
representing whether we have obtained a “3” or not. If the true probability is 6 = 0.25, then 
we will have P[X,, = 3] = 0 = 0.25 and P[X,, 4 3] = 1 — 6 = 0.75. We know that we cannot 
access the true probability, so we can only construct an estimator of the probability: 


In this experiment, we can show that 0 = 290/1000 = 0.29. 
To make our problem slightly easier, we pretend that we know the variance Var[X,)]. 
In practice, we certainly do not know Var[X,,], and so we need to estimate the variance. If 
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we knew the variance, it should be Var[X,,] = 0(1 — 0) = 0.25(1 — 0.25) = 0.1875, because 
X,, is a Bernoulli random variable with a mean @. 

The question asked by hypothesis testing is: How far is “6 = 0.29” from “@ = 0.25"? 
If the statistic generated by our data, O = 0.29, is “far” from the hypothesized 6 = 0.25, 
then we need to reject Hp because Hp says that 6 = 0.25. However, if there is no strong 
evidence that @ > 0.25, we will need to assume that Ho may possibly be true. So the key 
question is what is meant by “far”. a 2 

For many problems like this one, it is possible to analyze the PDF of ©. Since O is the 
sample average of a sequence of Bernoulli random variables, it follows that 6 is a binomial 
(with a scaling constant 1/N). If N is large enough, e.g., N > 30, the Central Limit Theorem 
tells us that © is also very close to a Gaussian. Therefore, we can more or less claim that 


~~ . o2 
© ~ Gaussian (s =) : 


With a simple translation and scaling, we can normalize 6 to obtain Z: 


6-4 
o/VN 


Figure 9.10 illustrates the range of values for this problem. There are two axes: the 0- 
axis (which is the estimator) and the Z-axis (which is the normalized variable). The values 
corresponding to each axis are shown in the figure. For example. 6 = 0.29 is equivalent 
to ZG = 2. 92, and © = 0.25 is equivalent to Z = 0, etc. Therefore, when we ask how far 
“6 = 0.29” is from “0 = 0. 25”, we can map this question from the O-axis to the Z- axis, 
and ask the relative position of Z from the or igin. 


Z= 


~ Gaussian (0,1). 


Figure 9.10: The mapping between © and Z. To decide whether we want to reject or keep Ho, the 
critical-value approach compares Z relative to the critical value za. 


_On a computer, obtaining these values is quite straightforward. Using MATLAB, find- 
ing Z can be done by calling the following commands. The Python code is analogous. 


%, MATLAB command to estimate the Z_hat value. 
Theta_hat = 0.29; % Your estimate 
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theta = 0.25; % Your hypothesis 

sigma = sqrt (theta*(1-theta)) ; % Known standard deviation 
N = 1000; % Number of samples 

Z_hat = (Theta_hat - theta)/(sigma/sqrt(N)); 


# Python command to estimate the Z_hat value 

import numpy as np 

Theta_hat = 0.29 Your estimate 

theta = 0.25 Your hypothesis 

N = 1000 # Number of samples 

sigma = np.sqrt(theta*(1-theta)) # Known standard deviation 
Z_hat = (Theta_hat - theta)/(sigma / np.sqrt(N)) 

print (Z_hat) 


# 
# 


One essential element of hypothesis testing is the cutoff threshold, which is defined 
through the critical level a. It is the area under the curve of the PDF of a Typically, 
a is chosen to be a small value, such as a = 0.05 (corresponding to a 5% margin). The 
corresponding cutoff is known as the critical value. It is defined as 


Zq = cutoff location where area under the curve is a. 
If Z is Gaussian(0,1) and if we are looking at the right-hand tail, it follows that 
Za = 01 (1—a). (9.24) 


In our example, we find that z9.95 = 1.65, which is marked in Figure 9.10. 
On computers, determining the critical value zq is straightforward. In MATLAB the 
command is icdf, and in Python the command is stats.norm.ppf. 


% MATLAB code to compute the critical value 
alpha = 0.05; 
z_alpha = icdf(’norm’, i-alpha, 0, 1); 


# Python code to compute the critical value 
import scipy.stats as stats 

alpha = 0.05 

z_alpha = stats.norm.ppf(1-alpha, 0, 1) 


Do we have enough evidence to reject Ho in this example? Of course! The estimated 
value © = 0.29 is equivalent to ZG =2. 92, which is much too far from the cutoff z. = 1.65. 
In other words, we conclude that at a 5% critical level we have strong evidence to believe 
that the die is biased. Therefore, we need to reject Ho. 

This conclusion makes a lot of sense if you think about it carefully. The estimator 
S = 0.29 is obtained from N = 1000 independent experiments. If we were only conducting 

= 20 experiments, it might be consistent with the null hypothesis to have O = 0.29. 
ee if we have N = 1000 experiments, having © = 0.29 does not seem likely when 
there is no systematic bias. If there is no systematic bias, the estimator © should slightly 
jitter around 6 =0. 25, but it is quite unlikely to vary wildly to 6 = 0.29. Thus, based on 
the available statistics, we decide to reject the null hypothesis. 
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The decision based on comparing the critical value is known as the critical-value test. 
The idea (for testing a right-hand tail of a Gaussian random variable) is summarized in 
three steps: 


How to conduct a critical-value test 
e Set a critical value z. Compute Z = (0 — 0)/(o/VN). 


elfZ> Zq, then reject Ho. 
elfZ< Zq, then keep Ho. 


If you are testing a left-hand tail, you can switch the order of the inequalities. 


The critical-value test belongs to a larger family of testing procedures based on deci- 
sion theory. To give you a preview of the general theory of hypothesis testing, we define a 
decision rule, a function that maps a realization of the estimator to a binary decision space. 
In our problem the estimator is Z (or equivalently ©). We denote its realization by 2. The 
binary decision space is {Hp, H,}, corresponding to whether we want to claim Ho or Hy. 
Claiming Hp is equivalent to keeping Ho, and claiming Hy is equivalent to rejecting Ho. 
For the critical-value test, the decision rule 6(-) : R > {0,1} is given by the equation (for 

testing a right-hand tail): 
=f} if 22 2a, (claim Hh), (0.28) 

0, if Z< Za, (claim Ho). 


Example 9.7. It was found that only 35% of the children in a kindergarten eat 
broccoli. The teachers conducted a campaign to get more kids to eat broccoli, after 
which it was found that 390 kids out of 1009 kids reported that they had eaten broccoli. 
Has the campaign successfully increased the number of kids eating broccoli? Assume 
that the standard deviation is known. 


Solution. We setup the null and the alternative hypothesis. 
Hy: @=0.35, Hy: @> 0.35. 


We construct an estimator 0 = (1/N) Soe Xy, where X,, is Bernoulli with proba- 


bility 6. Based on 0, 0? = 6(1 — 0) = 0.227. (Again, in practice we do not know the 
true variance, but in this problem we pretend that we know it.) 


By the Central Limit Theorem, 6 is roughly a Gaussian. We compute the test 
statistics © = 39° — 0.387. Standardization gives Z = 5% = 2.432. At a 5% 


1009 o/VN 
critical level, we have that zq = 1.65. So Y= 9439 > 1.65 = z,, and hence we need 
to reject the null hypothesis. Even if we choose a 1% critical level so that z. = 2.32, 
our estimator Z = 2.432 > 2.32 = Zq Will still reject the null hypothesis. 
A graphical illustration of this problem is shown in Figure 9.11. It can be seen 
that O = 0.387 is actually quite far away from the cutoff 1.65. Thus, we need to reject 
the null hypothesis. 
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0 
3-25 -2 -1.5 -1 


0.387 


Figure 9.11: Example of a critical-value test. In this example, the test statistic 6 = 0.387 is 
equivalent to Z = 2.432, which is significantly larger than the cutoff z. = 1.65. Therefore, we 
have strong evidence to reject the null hypothesis, because the probability of obtaining O = 0.387 
is very low if Ho is true. 


9.3.3 p-value test 


An alternative to the critical-value test is the p-value test. Instead of looking at the cutoff 
value z,, we inspect the probability of obtaining our observation if Hp is true. To understand 
how the p-value test works, we consider another toy problem. 

Suppose that we have two hypotheses about flipping a coin: 


e Hy: 6=0.9, which is our default belief. 


e A,: 6 < 0.9, which is a one-sided alternative. 


It was found that with N = 150 coin flips, the coin landed on heads 128 times. Thus the 


estimator is 0 = 3 = 0.853. Then, by following our previous procedures, we have that 


PK 0-90 _ 0.853 — 0.9 _~ 199. 
o/VN ye) 
150 


At this point we can follow the previous subsection by computing the critical value zg 
and make the decision. However, let’s take a different route. We want to know what is the 
probability under the curve if we integrate the PDF of Z from —oo to —1.92. This is easy. 
Since Z is Gaussian(0, 1), it follows from the CDF of a Gaussian that 


P[Z < —1.92] = 0.0274. 
Se 
p-value 
Referring to Figure 9.12, the value 0.0274 is the pink area under the curve, which is the 
PDF of Z. Since the area under the curve is less than the critical level a (say 5%), we reject 
the null hypothesis. 
On computers, computing the p-value is done using the CDF commands. 


% MATLAB code to compute the p-value 


p = cdf(’norm’, -1.92, 0, 1); 
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Figure 9.12: The p-value test asks us to look at the probability of Z <2. Mf this probability (the p-value) 
is less than the critical level a, we have significant evidence to reject the null hypothesis. 


# Python code to compute the p-value 


import scipy.stats as stats 
p = stats.norm.cdf(-1.92,0,1) 


In this example, the probability P[Z < —1.92] is known as the p-value. It is the 
probability of Z < z, under the distribution mandated by the null hypothesis, where z 
is the (normalized) estimated value based on data. Using our example, z is —1.92. By 
“distribution mandated by the null hypothesis” we mean that the PDF of Z is the PDF that 
the null hypothesis wants. In the above example the PDF is Gaussian(0, 1), corresponding 
to Gaussian(0,0/V/N) for ©. 

More formally, the p-value for a left-hand tail test is defined as 


ny 


p-value(Z) = P[Z < 2], 


where 2 is the random realization of Z estimated from the data. The decision rule based 
on the p-value is (for the left-hand tail): 


zZ]<a (claim H;), 


< 
me : (9.26) 
<z])>a (claim Hp). 


If the alternative hypothesis is right-handed, then the probability becomes P[Z > 2] instead. 

Relationship between critical-value and p-value tests. There is a one-to-one corre- 
spondence between the p-value and the critical value. In the p-value test, if Z is Gaussian, 
it follows that 


p-value = P[Z < 2] = ®(2), 
where ® is CDF of the standard Gaussian. Taking the inverse, the corresponding Z is 


2 = ©~'(p-value). 
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In practice, we do not need to take any inverse of the p-value to obtain Z because it is 
directly available from the data. 
To test the p-value, we compare it with the critical level a by checking 


p-value < a. 
Taking the inverse of both sides, it follows that the decision rule is equivalent to 
~' (p-value) < ~'(a), 
——SS as 
z Ze 


where the quantity on the right-hand side is the critical value z.. Therefore, if the test 
statistic fails in the p-value test it will also fail in the critical-value test, and vice versa. 


What is the difference between the critical-value test and p-value test? 


e Critical-value test: Compare w.r.t. critical value, which is the cutoff on the Z- 
axis. 


e p-value test: Compare w.r.t. a, which is the probability. 


e Both will give you the same statistical conclusion. So it does not matter which 
one you use. 


Example 9.8. We flip a coin for N = 150 times and find that 128 are heads. Consider 
two hypotheses 


e Ho: 6 =0.9, which is our default belief. 
e Hi: 040.9, which is a two-sided alternative. 


For a critical level of @ = 0.05, shall we keep or reject Ho? 


Solution. We know that 6 = 128/150 = 0.853. The normalized statistic is 


6-6 0.853-0.9 _ 
N 0.9(1—0.9) 
Oy V 150 


Be 1.92. 


To compute the p-value, we observe that the two-sided test means that we consider 
the two tails. Thus, we have 
p-value = P{|Z| > 1.92] 
=2x P[Z > 1.92] 
= 2 x 0.0274 = 0.055. 
For a critical level of a = 0.05, the p-value is larger. This means that the probability 


of obtaining |Z| > 1.92 is not extreme enough. Therefore, we do not have sufficient 
evidence to reject the null hypothesis. 
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If we take the critical-value test, we will reach the same conclusion. The critical 
value for a = 0.05 is determined by taking the inverse CDF at 1 — 0.025, giving 


aor (1 = <) 106 
e 2 


Since Z = 1.92 has not passed this threshold, we conclude that there is not enough 
evidence to reject the null hypothesis. 


0.4 
0.35 
0.35 
0.25 + 
0.2; 
0.15; 
0.1 
0.05 ; 


0 
-3 


-2.5 1.5 -1 


= 1.96 | _—1.92 L 1.92 1.96 a 
0.853 


Figure 9.13: Example of a two-sided test using the p-value and the z,-value. 


9.3.4 Z-test and T-test 


The critical-value test and the p-value tests are generic tools for hypothesis testing. In this 
subsection we introduce the Z-test and the T-test. It is important to understand that the 
Z-test and the T-test refer to the distributional assumptions we make about the variance. 
They define the distribution we use to conduct the test but not the tools. In fact, both the 
Z-test and the T-test can be implemented using the critical-value test or the p-value test. 
Figure 9.14 illustrates the hierarchy of the tests. 


assumption distribution tool 


known p-value 
Gaussian 


—— “7-test” critical-value 
Hypothesis testing 


for sample mean 


p-value 
unknown Student’s T 
variance “T-test” critical-value 


Figure 9.14: When conducting a hypothesis testing of the sample average, we may or may not know 
the variance. If we know the variance, we use the Gaussian distribution to conduct either a p-value test 
or a critical-value test. If we do not know the variance, we use Student's ¢-distribution. 


The difference between the Gaussian distribution and the T distribution is mainly 


attributable to the knowledge about the population variance. If the variance is known, 
the distribution of the estimator (which in our case is the sample average) is Gaussian. If 
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the variance is estimated from the sample, the distribution of the estimator will follow a 
Student’s ¢-distribution. 

To introduce the Z-test and the T-test we consider the following two examples. The 
first example is a Z-test. 


Example 9.9 (Z-test). Suppose we have a Gaussian random variable with unknown 
mean @ and a known variance 0 = 11.6. We draw N = 25 samples and construct an 
estimator 0 = 80.94. We propose two hypotheses: 


e Ho: 6 = 85, which is our default belief. 


e Ay: 6 < 85, which is a one-sided alternative. 
For a critical level of @ = 0.05, shall we keep or reject the null hypothesis? 


Solution. The test statistic is 


Ti eA 


Since the individual samples are assumed to follow a Gaussian, the sample average 6 
is also a Gaussian. Hence, Z is distributed according to Gaussian(0, 1). 


Figure 9.15: A one-sided Z-test using the p-value and the z,-value. 


For a critical level of 0.05, a one-sided critical value is 
2 = © (1 —a@) = —1.645, 


Since Z = —1.75, which is more extreme than the critical value, we conclude that we 
need to reject Ho. 
If we use the p-value test, we have that the p-value is 


P[Z < —1.75] = 6(—1.75) = 0.0401. 


Since the p-value is smaller than the critical level a = 0.05, it implies that ZF =-1.75 
is more extreme. Hence, we reject Ho. 
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The following example is a T-test. In a T-test we do not know the population variance 
but only know the sample variance S. Thus the test statistic we use is a T random variable. 


Example 9.10 (T-test). Suppose we have a Gaussian random variable with unknown 
mean @ and an unknown variance a. We draw N = 100 samples and construct an 
estimator © = 130.1, with a sample variance S = 21.21. We propose two hypotheses: 


e Ho: 86 = 120, which is our default belief. 
e Hy: 0 £120, which is a two-sided alternative. 


For a critical level of @ = 0.05, shall we keep or reject the null hypothesis? 


Solution. The test statistic is 


Note that while the sample average Oisa Gaussian, the test statistic T is distributed 
according to a T' distribution with N — 1 degrees of freedom. For a critical level of 
0.05, a two-sided critical value is 


te = Wag (1 - =) = 1.984. 


Since T = 4.762, which is more extreme than the critical value, we conclude that we 
need to reject Ho. 
If we use the p-value test, we have that the p-value is 
P||T| > 4.762] = 2 x P[T > 4.762] 
= 328 <10-*: 


Since the p-value is (much) smaller than the critical level a = 0.05, it implies that 
|T| > 4.762 is quite extreme. Hence, we reject Ho. 


" Tdistribution | 


p-value = 3.28 x 107°. 


1.984 4.762, 


120 130.1 


Figure 9.16: A two-sided T-test using the p-value and the z,-value. 
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For this example, the MATLAB and Python commands to compute t,, and the p-value 
are 


% MATLAB code to compute critical-value and p-value 
t_alpha = icdf(’t’, 1-0.025, 99); 
p = i-cdf(’t’, 4.762, 99); 


# Python code to compute critical value and p-value 
import scipy.stats as stats 

t_alpha = stats.t.ppf(1-0.025,99) 

p = i-stats.t.cdf(4.762,99) 


What are the Z-test and the 7-test? 


e Both are hypothesis testings for the sample averages. 


e Z-test: Assume known variance. Hence, use the Gaussian distribution. 


e T-test: Assume unknown variance. Hence, use the Student’s ¢-distribution. 


Remark. We are exclusively analyzing the sample average in this section. There are other 
types of estimators we can analyze. For example, we can discuss the difference between the 
two means, the ratio of two random variables, etc. If you need tools for these more advanced 
problems, please refer to the reference section at the end of this chapter. 


9.4 Neyman-Pearson Test 


The hypothesis testing procedures we discussed in the previous section are elementary in 
the sense that we have not discussed much theory. This section aims to fill the gap so that 
you can understand hypothesis testing from a broader perspective. This generalization will 
also help to bridge statistics to other disciplines such as classification in machine learning 
and detection in signal processing. We call this theoretical analysis the Neyman-Pearson 
framework. 


9.4.1 Null and alternative distributions 


When we discussed hypothesis testing in the previous section, we focused exclusively on the 
null hypothesis Ho. Regardless of whether we are studying the Z-test or the T-test, using 
the critical value or the p-value, all the distributions are associated with the distribution 
under Ho. m 

What do we mean by “distribution under Ho”? Using © as an example, the PDF of 
© is assumed to be Gaussian(@,07/N). This Gaussian, centered at 0, is the distribution 
assumed under Ho. As we decide whether to keep or reject Hg, we look at the critical value 
and the p-value of the test statistic under Gaussian(0,0?/N). 
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Importantly, the analysis of hypothesis testing is not just about Hp — it is also about 
the alternative hypothesis H,, which uses a different PDF. For example, H, could use 
Gaussian(9’,0?/N) for 6’ > 0. Therefore, for the same testing statistic O, we can check how 
close it is to Hy. 

To capture both distributions, we define 


fo(y) = fy(y | Ho), 
fiy) = fr(y | 41). 
The first PDF defines the distribution when the true model is Ho. The second PDF is the 


distribution when the true model is Ay. 


Example 9.11. Consider an estimator Y ~ Gaussian(0, 0?/N). Define two hypotheses 
Ho : 6 = 120 and H,: 0 > 120. The two PDFs are then 


fo(y) = fy (yl|Ho) = Gaussian(120, 07/N), 
fily) = fy (y| Hi) = Gaussian(6’,07/N), 6’ > 120. 


A graph of the two distributions is shown in Figure 9.17. In this figure we plot the 
PDF under the null hypothesis and the PDF under an alternative hypothesis. The decision 
is based on the null, where we marked the critical value. 


raX 


0 — 
-3 -2.5 -2 -1.5-1-0.5 0 0.5 1 1.5 \" 335445 5 5.5 6 


reject Ho if test statistics > this critical-value 


Figure 9.17: The PDF of the estimator under hypotheses Hp and H;. The yellow region defines the 
rejection zone R,. If the estimator has a realization Y = y that falls into the rejection zone Ry, we 
need to reject Ho. 


Students are frequently confused about the exact equation of the PDF under Hj. If 
the alternative hypothesis is defined as 0 > 120, shall we define the PDF as a Gaussian 
centered at 130 or 151.4? They are both valid alternative hypotheses. The answer is that 
we are going to express all equations based on 6’. For example, if we want to analyze the 
prediction error (this term will be explained later), the prediction error will be a function 
of 6’. If 6’ is close to 6, we will expect a larger prediction error. However, if 6’ is far away 
from 6, the prediction error may be small. 

Whenever we discuss hypothesis testing, a decision rule is always implied. A decision 
rule is a mapping 6(-) from sample space JY of the test statistic Y (or O if you prefer) to the 
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binary space of {0, 1}: 


(9.27) 


se) = 1, if ye Ra, (we will reject Ho), 
Be NG, if y@Ra, (we will keep Hp). 


Here R, is the rejection zone. For example, in a one-sided testing at a critical level a, the 
rejection zone is Ra = {y > ®~'(1—a)}. Therefore, as long as y > ®~!(1 — a), we will 
reject the null hypothesis. Otherwise, we will keep the null hypothesis. A rejection zone can 
be one-sided, two-sided, or even more complicated. 


Example 9.12. Consider Ho : 6 = 0.35 and H, : 6 > 0.35. It was found that the 
sample average over 1009 samples is 0 = 0.387, with a? = 0.227. The normalized test 
statistic is Z = VN(O — 0)/o = 2.432. At a 5% critical level, define the decision rule 
based on the critical-value approach. 


Solution. If a = 0.05, it follows that z, = ®~1(1 — 0.05) = 1.65. Therefore, the 
decision rule is 


if Z>1.65, (we will reject Ho), 
if Z2< 1.65, (we will keep Ho), 


where 2 is the realization of Z. In this particular problem, we have Z = 2.432. Thus, 
according to the decision rule, we need to reject Ho. 


A decision rule is something you create. You do not need to follow the critical-value 
or the p-value procedure — you can create your own decision rule. For example, you can 
say “reject Ho when |y| > 0.000001”. There is nothing wrong with this decision rule except 
that you will almost always reject the null hypothesis (so it is a bad decision rule). See 
Figure 9.18 for a graph of a similar example. If you follow the critical-value or the p-value 
procedures, it turns out that the resulting decision rule is equivalent to some form of optimal 
decision rule. This concept is the Neyman-Pearson framework, which we will explain shortly. 


9.4.2 Type 1 and type 2 errors 


Since hypothesis testing is about applying a decision rule to the test statistics, and since 
no decision rule is perfect, it is natural to ask about the error expected from a particular 
decision rule. In this subsection we define the decision error. However, the terminology varies 
from discipline to discipline. We will explain the decision error first through the statistics 
perspective and then through the signal processing perspective. 

Two tables of the cases that can be generated by a binary decision-making process are 
shown in Figure 9.19. The columns of the tables are the true statements, i.e., whether the 
test statistic has a population distribution under Hp or H,. The rows of the tables are the 
statements predicted by the decision rule, i.e., whether we should declare the statistics are 
from Ho or H,. Each combination of the truth and prediction has a label: 


e True positive: The truth is H,, and you declare Ay. 
e True negative: The truth is Hp, and you declare Ho. 


e False positive: The truth is Hp, and you declare Hy. 
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0.4 
0.35 
0.3 
0.25 
0.2 
0.15 
0.1 
0.05 


0 L ™ 
a aa 
y > 1.96 


; y 2 0.5 1, 
Bu) = 4 y <05 0, y<1.96 


joe decision rule 1 
decision rule 2 


Figure 9.18: Two possible decision rules 61(y) and d2(y). In this example, 51(y) is designed according 
to the critical-value approach at a = 0.025, whereas 62(y) is arbitrarily designed. Both are valid decision 
rules, although 62 should not be used because it tends to reject the null hypothesis more often than 
desired. 


e False negative: The truth is Hy, and you declare Ho. 


Different communities have different ways of labeling these quantities. In the statistics com- 
munity the false negative rate (i.e., the number of false negative cases divided by the total 
number of cases) is called the type 2 error, and the false positive rate is called the type 1 
error. The true positive rate is called the power of the decision rule. 

In the engineering community (e.g., radar engineering and signal processing) the ob- 
jective is to detect whether a target (e.g., a missile or an enemy aircraft) is present. In this 
context, the false positive rate is known as the probability of false alarm, since personnel 
will be alerted when no target is present. The false negative rate is known as the probability 
of miss because you miss a target. If the truth is H; and the prediction is also Hy, we call 
this the probability of detection. 


Truth Truth 
A 0 Ay Ho Ay 
True False Type 2 
5 Ho Negative Negative 5 Ao Miss 
8 8 
2 H False True 2 H Type 1 Power 
1 Positive Positive 1 False Alarm Detection 


Figure 9.19: Terminologies used in labeling the prediction error. The terms “Type 1 error’ and “Type 
2 error” are commonly used by the statistics community, whereas the terms “false alarm”, “miss” and 
“detection” are more often used in the engineering community. 


The diagram in Figure 9.20 will help to clarify these definitions. Given two hypotheses 
Ho and Hj, there exists the corresponding distributions fo(y) and f1(y), which are the PDFs 
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of the test statistics Y (or O if you prefer). Supposing that our decision rule is to declare 
H, when Y > n for some 7, for example, 7 = 1.65 for a 5% critical level, there are two areas 
under the curve that we need to consider. 


e Type 1 / False alarm. The blue region under the curve represents the probability of 
declaring H; (i.e., we choose to reject the null) while the truth is actually Hp (i.e., we 
should have not rejected the null). Mathematically, this probability is 


pr =P[Y >| Ho] = fo(y) dy. (9.28) 


y2n 


e Type 2 / Miss. The pink region under the curve represents the probability of declaring 
Ho (i.e., we choose to keep the null) while the truth is actually H; (i.e., we should 
have rejected the null). Mathematically, this probability is 


pm =P[Y <n| MJ = fily) dy. (9.29) 
y<n 


0.4 
0.35 
0.3 
0.25 
0.2 
0.15 ; 
0.1 
0.05 


0 
-3 -2.5 -2-1.5-1-0.50 & §3 3.5445 5 5.5 6 


Type 1 Error / False Alarm / False Positive 


Type 2 Error / Miss / False Negative def 
pr =P[Y >n| Hol 


def 
pm = PY <n| Ai] 
Figure 9.20: Definition of type 1 and type 2 errors. 
The power of the decision rule is also known as the detection. It is defined as 
pp =PIY > n| Hil. (9.30) 


A plot illustrating the power of the decision rule is shown in Figure 9.21. Since pp is the 
conditional probability of Y > 7 given Hj, it is the complement of pj,, and so we have the 
identity 

Pp =1-—pm. 


Some communities refer to the above quantities in terms of the counts instead of the 
probabilities. The difference is that the probabilities are normalized to [0,1] whereas the 
counts are just the raw integers obtained from running an experiment. We prefer to use the 
probabilities because they are the theoretical values. If you tell us the distributions fp and 
fi, we can report the probabilities. The counts, by contrast, are just another form of sample 
statistics. The number of counts today may be different from the number of counts tomorrow 
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Power / Detection 
def 
po = PY >| Ai] 


Figure 9.21: The power of the decision rule is the area under the curve of fi, integrated for y inside 
the rejection zone. 


because they are obtained from the experiments. The difference between probabilities and 
counts is analogous to the difference between PMFs and histograms. 

Since the probability of errors changes as the decision rule changes, it is necessary to 
define pr, pp and py, as functions of 6. In addition, hypothesis testing is not limited to one- 
sided tests. We can define the rejection zone as Ry = {y | reject Ho using a critical level a}. 
The probabilities pr and py, are defined as 


pr(8) = / () fol) dy = i; _,_ fala) dy (9.31) 
pu (8) = / 5(y) flu) dy = / og, fi a (9.32) 


Using the property that pp = 1 — py, we have that 
po()=1-puld)= ffi) (9.33) 
yEeRa 


Note that the rejection zone does not need to depend on a. You can arbitrarily define the 
rejection zone, and the probabilities pr, pyz, and pp can still be defined. 


Example 9.13. Find pr(61) and pr(d2) for the decision rule in Figure 9.18. 


Solution. Since fp is a Gaussian with zero mean and unit variance, it follows that 


ane y 
pr(d1) = / —=e * dy =1- ®(1.92) = 0.025, 
1.96 V20 


2 
e- = dy =1— &(0.5) = 0.3085. 


9.4.3. Neyman-Pearson decision 


At this point you have probably observed something about the critical-value test and the 
p-value test. Among the four types of decision combinations, we are looking at the false 
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positive rate, or the probability of false alarm pr(d). The critical-value test requires us to 
find 6 such that pr(6) is equal to a. That is, if you tell us the critical level a (e.g., a = 0.05), 
we will find a decision rule (by telling you the cutoff) such that the false alarm rate is a. 
Consider an example: 


Example 9.14. Let a = 0.05. Assume that fp is a Gaussian with zero-mean and 
unit-variance. Let us do a one-sided test for Hp : 9 = 0 versus H, : 0 > 0. Find 6 such 
that pr(d) =a. 


Solution. Let the decision rule 6 be 


I 
sud= {4 
» Cam 


Our goal is to find 7. The probability of false alarm is 


co 1 y2 
Equating this to a, it follows that 1— (7) = a implies 7 = ®~1(1—a@) = 1.65. So the 
decision rule becomes 

y > 1.65, 


y < 1.65. 


If you apply this decision rule, you are guaranteed that the false alarm rate is a = 0.05. 


But why should we aim for pr(d) equal to a? Isn’t a lower false alarm rate better? 
Indeed, we would not mind having a lower false alarm, so we are happy to have any 6 
that satisfies pr(d) < a. However, changing the equality to an inequality means that we 
now have a set of 6 instead of a unique 6. More important, we need to pay attention to 
the trade-off between pr(d) and pp(d). The smaller the pr(d) a decision rule 6 provides, 
the smaller the pp(6) you can achieve. This is immediately apparent from Figure 9.20 and 
Figure 9.21. (If you move the cutoff to the right, the gray area and the blue area will both 
shrink.) Therefore, the desired optimization should be formulated as: From all the decision 
rules 6 that have a false alarm rate of no larger than a, we pick the one that maximizes the 
detection rate. The resulting decision rule is known as the Neyman-Pearson decision rule. 


Definition 9.2. The Neyman-Pearson decision rule 7s defined as the solution to the 
optimization 


o* = argmax pp(6), 
5 


subject to pr(d) <a. 


Figure 9.22 illustrates two decision rules 6*(y) and 6(y). The first decision rule 6* (y) is 
obtained according to the critical-value approach, with a = 0.025. As we will prove shortly, 
this is also the optimal Neyman-Pearson decision rule for a one-sided hypothesis testing at 
a = 0.025. The second decision rule 6(y) has a harsher cutoff, meaning that you need an 
extreme test statistic to reject the null hypothesis. Clearly, the p-value obtained by 6(y) is 
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less than a = 0.025. Thus, d(y) is a valid decision rule according to the Neyman-Pearson 
formulation. However, 6(y) is not optimal because the detection rate is not maximized. 


: 1, “y>1.96 1 y>2.2 
O*(y) = d(y) = 
: 0, y < 2.2 


Figure 9.22: Two decision rules d(y) and 5*(y). Assume that a = 0.025. Then d(y) is one of the many 
feasible choices in the Neyman-Pearson optimization, but 6*(y) is the optimal solution. 


Because of the complementary behavior of pr and pp, it follows that pp is maximized 
when pr hits the upper bound. If we want to maximize the detection rate we need to stretch 
the false alarm rate as much as possible. As a result, the Neyman-Pearson solution occurs 
when pr(d) = a, i.e., when the equality is met. 

The Neyman-Pearson framework is a general framework for all distributions fo and fi, 
as opposed to the critical-value and p-value examples, which are either Gaussian or Student’s 
t-distribution. The solution to the Neyman-Pearson optimization is a decision rule known 
as the likelihood ratio test. The likelihood ratio is defined as follows. 


Definition 9.3. The likelihood ratio for two distributions fi(y) and fo(y) is 


_ fily) 


Ly) = fo(y) 


(9.35) 


It turns out that the solution to the Neyman-Pearson optimization takes the form of the 
likelihood ratio. 


Theorem 9.2. The solution to the Neyman-Pearson optimization is a decision rule 
that checks the likelihood ratio 


> 
2 1, (9.36) 
<1; 


for some decision boundary n which is a function of the critical level a. 


586 


9.4. NEYMAN-PEARSON TEST 


What is so special about Neyman-Pearson decision rule? 


e It is the optimal decision. Its optimality is defined w.r.t. maximizing the detection 
rate while keeping a reasonable false alarm rate: 


o* = argmax pp(0), 
5 


subject to pr(d) <a. 


If your goal is to maximize the detection rate while maintaining the false alarm 
rate, you cannot do better than Neyman-Pearson. 


Its solution is the likelihood ratio test: 


where L(y) = fi(y)/fo(y) is the likelihood ratio. 


The critical-value test and the p-value test are special cases of the Neyman- 
Pearson test. 


Deriving the solution to the Neyman-Pearson optimization can be skipped if this is your 
first time reading the book. 


Proof. Given a, choose 6* such that the false alarm rate is maximized: pr(6*) = a. Then, 
by substituting the definition of 6* into the false alarm rate, 


o=pe(s)= fs (wfolv) dy 
= 7 1 fo(y) dy + | 0+ fo(y) dy. (9.37) 
L(y)>n 


L(y)<n 


Now, consider another decision rule 6 that is not optimal but is feasible. That means that 
6 satisfies pr(d) < a. Therefore, 


a> pe(6)= f sly)folw) dy 


7 | 8(4) - foly) dy + | HOm Ot (9.38) 
L(y)>n 


Lty)<n 


Our goal is to show that pp(d*) > pp(d), because by proving this result we can claim that 
6* maximizes the detection rate. 
By combining Equation (9.37) and Equation (9.38), we have 


0 < pr(o") — pr(6) 


- | (1 — 6(y)) foly) dy — i 5(y) folu) dy. (9.39) 
L(x)>n 


Lyy)<n 
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Define L(y) = ae Then L(y) > 7 if and only if fi(y) > nfo(y). So, 


ppl") —pnt0)= | poglt SUDA) do | Kennedy 


where the last inequality holds because of Equation (9.39). Therefore, we conclude that 6* 
Maximizes pp. 


End of the proof. Please join us again. 


At this point, you may object that the likelihood ratio test (i.e., the Neyman-Pearson 
decision rule) is very different from the hypothesis testing examples we have seen in the 
previous chapter because now we need to handle the likelihood ratio L(y). Rest assured 
that they are the same, as illustrated by the following example. 


Example 9.15. Consider two hypotheses: Hp : Y ~ Gaussian(0,07), and Hy, : Y ~ 
Gaussian(j1,07), with > 0. Construct the Neyman-Pearson decision rule (i.e., the 
likelihood ratio test). 


Solution. Let us first define the likelihood functions. It is clear from the description 
that 


il y? 
V 2102 eae 20? 


Therefore, the likelihood ratio is 


fo(y) = 


\ an. a= worl. 


il 
V 2102 ee { 20? 


L(y) = wae = exp {sa . 24) 


The likelihood ratio test states that the decision rule is 
2, 
< 1). 
So it remains to simplify the condition L(y) 2 n. To this end, we observe that 


yu” — wy) > logy 


1 
L(y) > a ae 
(y)=n 552 | 


— 2 
—{}>-—=sP{>—_“ 


def 
ie 


2 
y= 5 ~~ bose. 
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Therefore, instead of determining 7, we just need to define 7 because the decision rules 
based on 7 and 7 are equivalent. 

To determine 7, Neyman-Pearson states that pr(d) < a (and at the optimal point 
the equality has to hold). Substituting this criterion into the decision rule, 


Taking the inverse of the CDF, we obtain rT: 


F=00 '(1—a). 
Putting everything together, the final decision rule is 


y 2 oO "(1 a Q), 
y<o® (1-a). 


So if a = 0.05 we will reject Ho when y > 1.650. We can also replace o by a/VN if 
the estimator is constructed from multiple measurements. 


The above example tells us that even though the likelihood ratio test may appear 
complicated at first glance, the decision is the same as the good old hypothesis testing rules 
we have derived. The flexibility we have gained with the likelihood ratio test is the variety 
of distributions we can handle. Instead of restricting ourselves to Gaussians or Student’s 
t-distribution (which exclusively focuses on the sample averages), the likelihood ratio test 
allows us to consider any distributions. The exact decision rule could be less obvious, but 
the method is generalizable to a broad range of problems. 


Practice Exercise 9.5. In a telephone system, the waiting time is defined as the 
inter-arrival time between two consecutive calls. However, it is known that sometimes 
the waiting time can be mistakenly recorded as the time between three consecutive 
calls (i.e., by skipping the second one). Since the interarrival time of an independent 
Poisson process is either an exponential random variable or an Erlang random variable, 
depending on how many occurrences we are counting, we define the hypotheses 


yeu 
0, y <0. 


e 4, 


and fil )={ 


Suppose we are given one measurement Y = y. Find the Neyman-Pearson decision 
rule for a = 0.05. 
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Solution. The likelihood ratio is 


iy) _ ye" 
fo(y) cme 


Substituting this into the decision rule, we have 


L(y) 


Ly) =n y=, 
L(y) <n y <7. 


Lily) > n= y= — loga, 
L(y) <n <=> y < —loga. 


For a = 0.05, we reject the null hypothesis when y > 2.9957. Figure 9.23 illustrates 


the hypothesis testing rule. 


y > 2.9957 
y < 2.9957 


Figure 9.23: Neyman-Pearson decision rule at a = 0.05. 


Remark. This example is instructive in that we have only one measurement Y = y. 
If we have repeated measurements and take the average, then the Central Limit The- 
orem will kick in. In that case, we can resort to our favorite Gaussian distribution 
or Student’s ¢-distribution instead of dealing with the exponential and the Erlang 
distributions. However, the example demonstrates the usefulness of Neyman-Pearson, 
especially when the distributions are complicated. 
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9.5 ROC and Precision-Recall Curve 


Being a binary decision rule, the hypothesis testing procedure shares many similarities with 
a two-class classification algorithm.? Given a testing statistic or a testing sample, both 
the hypothesis testing and a classification algorithm will report YES or NO. Therefore, 
any performance evaluation metric developed for hypothesis testing is equally applicable to 
classification and vice versa. 

The topic we study in this section is the receiver operating characteristic (ROC) curve 
and the precision-recall (PR) curve. The ROC curve and the PR curve are arguably the 
most popular metrics in modern machine learning, in particular for classification, detection, 
and segmentation tasks in computer vision. There are many unresolved questions about 
these two curves and there are many debates about how to use them. Our goal is not to add 
another voice to the debate; rather, we would like to fill in the gap between the hypothesis 
testing theory (particularly the Neyman-Pearson framework) and these two sets of curves. 
We will establish the equivalence between the two curves and leave the open-ended debates 
to you. 


9.5.1 Receiver Operating Characteristic (ROC) 


Our approach to understanding the ROC curve and the PR curve is based on the Neyman- 
Pearson framework. Under this framework, we know that the optimal decision rule w.r.t to 
the Neyman-Pearson criterion is the solution to the optimization 


o*(a) = argmax pp(0d) 
5 
subject to pr(d) <a. 


As a result of this optimization, the decision rule 6* will achieve a certain false alarm rate 
pr(d*) and detection rate pp(6*). Clearly, the decision rule 6* changes as we change the 
critical level a. Accordingly we write 6* as 6*(a) to reflect this dependency. 

What this observation implies is that as we sweep through the range of a’s, we construct 
different decision rules, each one with a different pr and pp. If we denote the decision rules 
by 61, 62,---,6,2, we have M pairs of false alarm rate pr and detection rate pp: 


e Decision rule 6;: False alarm rate pr(d,) and detection rate pp(d1). 


e Decision rule 52: False alarm rate pr(d2) and detection rate pp(d2). 


e Decision rule 6,7: False alarm rate pr(d,,) and detection rate pp(d,s). 


3Tn a classification algorithm, the goal is to look at the testing sample y and compute certain thresholding 
1, wid(y)>r 
0, wid(y) <r 
you can think of the vector w as the regression coefficient, and ¢(-) is some kind of feature transform. The 
equation says that class 1 will be reported if the inner product is larger than a threshold 7, and class 0 
will be reported otherwise. Therefore, a binary classification, when written in this form, is the same as a 
hypothesis testing procedure. 


criteria. For example, a typical decision rule of a classification algorithm is 6(y) = { . Here, 
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If we plot pp(d) on the y-axis as a function of pr(d) on the z-axis, we obtain a curve shown 
in Figure 9.24 (see the example below for the problem setting). The black curve shown on 
the right is known as the receiver operating characteristic (ROC) curve. 


PD 


o 
o 
i) 
S 
b 
o 
a 
i) 
) 
_—s 


PF 


o 


a2 a4 a a8 oe q 0 0.2 0.4 0.6 0.8 1 PF 


Figure 9.24: An example of an ROC curve, where we consider two hypotheses: Ho : Y ~ Gaussian(0, 2), 
and H; : Y ~ Gaussian(3, 2). We construct the Neyman-Pearson decision rule for a range of critical 


levels a. For each a we compute the theoretical pr(a) and pp(a), shown on the left-hand side of the 
figure. The pair of (pp, pr) is then plotted as the right-hand side curve by sweeping the a's. 


The setup of the figure follows the example below. 


Example 9.16. We consider two hypotheses: Hp : Y ~ Gaussian(0,2), and H;: Y ~ 
Gaussian(3, 2). Derive the Neyman-Pearson decision rule and plot the ROC curve. 


Solution. We construct a Neyman-Pearson decision rule: 


oY] 2 oO"(1 a Q), 
y<o® (1-a). 


where 7 is a tunable threshold. For example, if a = 0.05, then ¢@~!(1—0.05) = 3.2897, 


and if a = 0.1, then o@~1(1 — 0.1) = 2.5631. Therefore, the false alarm rate and the 
detection rate are functions of the critical level a. 

For this particular example, we have the false alarm rate and detection rate in 
closed form, as functions of a: 
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1 (=n)? 
po(a)= | a ree dy 


®-1(1-a) V 270? 
=1-0(@10-a)-*). 


These give us the two curves on the left-hand side of Figure 9.24. 


What is an ROC curve? 


e It is a plot showing pp on the y-axis and pr on the x-axis. 


e pp = detection rate (also known as the power of the test). 


e pr = false alarm rate (also known as the type 1 error of the test). 


The ROC curve tells us the behavior of the decision rule as we change the threshold a. 
A graphical illustration is shown in Figure 9.25. There are a few key observations we need 
to pay attention to: 


1 
0.9 | 100% detection 
happens when you always claim Hy 
on but if you always claim Hy 
0.7| your false alarm is also 100% 
0.6; 
PD 05) [—— the line you can best possibly do. 
0.4| you cannot go beyond this line. 
0.3 |; 


0.2 


0% false alarm 
happens when you always claim Ho 
but you will never detect anything 


0.1 


0 0.2 0.4 0.6 0.8 4 
PF 


Figure 9.25: Interpreting the ROC curve. 


e The ROC curve must go through (0,0). This happens when you always keep the 
null hypothesis or always declare class 0, no matter what observations. If you always 
keep Ho, certainly you will not make any false positive (or false alarm), because you 
will never say Ho is wrong. Therefore, the detection rate (or the power of the test) is 
also 0. This is a useless decision rule for both classification and hypothesis testing. 


e The ROC curve must go through (1,1). This happens when you always reject the null 
hypothesis, no matter what observations we have. If you always reject Ho, you will 
always say that “there is a target”. As far as detection is concerned, you are perfect 
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because you have not missed any targets. However, the false positive rate is also high 
because you will falsely declare a target when there is nothing. Therefore, this is also 
a useless decision rule. 


e The ROC curve tells us the operating point of the decision rule as we change the thresh- 
old. A threshold is a universal concept for both hypothesis testing and classification. 
In hypothesis testing, we have the critical level a, say 0.05 or 0.1. In classification, we 
also have a threshold for judging whether a sample should be classified as class 1 or 
class 0. Often in classification, the intermediate estimates are probabilities or distances 
to decision boundaries. These real numbers need to be binarized to generate a binary 
decision. The ROC curve tells us that if you pick a threshold, your decision rule will 
have a certain pr and pp as predicted by the curve. If you want to tolerate a higher 
pr, you can move along the curve to find your operating point. 


e The ideal operating point on a ROC curve is when pr = 0 and pp = 1. However, this 
is a hypothetical situation that does not happen in any real decision rule. 


9.5.2 Comparing ROC curves 


Because of how the ROC curves are constructed, every binary decision rule has its own ROC 
curve. Typically, when one tries to compare classification algorithms, the area under the 
curve (AUC) occupied by the ROC curve is compared. A decision rule having a larger AUC 
is often a “better” decision rule. 

To illustrate the idea of comparing estimators, we consider a trivial decision rule based 
on a blind guess. 


Example 9.17. (A blind guess decision) Consider a decision rule that we reject Ho 
with probability a and keep Ho with probability 1— a. We call this a blind guess, since 
the decision rule ignores observation y. Mathematically, this trivial decision rule is 


with probability a, 
with probability 1 — a. 


Find pr, pp, and AUC. 


Solution. For this decision rule we compute its false positive rate (or false alarm rate) 
and its true positive rate (or detection rate). However, since 6(y) is now random, we 
need to take the expectation over the two random states that 6(y) can take. This gives 
us 


GSE | [orto iy 
7 i 1: fo(y) dyP[8(y) = 1) + i 0- fo(y) dyP[5(y) = 0] 
= a f folv) pe 
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Similarly, the detection rate is 


pola) =B| f sw)falw) ay] =0 f fy) dy =a. 


If we plot pp as a function of pr, we notice that the function is a straight line going 
from (0,0) to (1,1). This decision rule is useless. Comparing this with the Neyman- 
Pearson decision rule, it is clear that Neyman-Pearson has a larger AUC. The AUC 
for this trivial decision rule is the area of the triangle, which is 0.5. 

1 


0.9 
0.8 
0.7 Neyman-Pearson 
0.6 
0.5 
0.4 


blind guess 


0.4 0.6 0.8 
PF 


Figure 9.26: The ROC curve of the blind guess decision rule is a straight line. The AUC is 0.5. 


If you set a = 0.5, then the decision rule becomes 


with probability 3, 


with probability 5 ' 


This is equivalent to flipping a fair coin with probability 1/2 of declaring Hp and 1/2 
declaring Hy. Its operating point is the yellow circle. 


Computing the AUC can be done by calling special library functions. However, to 
spell out the details we demonstrate something more elementary. The program below is a 
piece of MATLAB code plotting two ROC curves corresponding to two different decision 
rules. The first decision rule is the trivial decision rule, where we have just shown that 
pr(a) = pp(a) = a. The second decision rule is the Neyman-Pearson decision rule, for 
which we showed in Figure 9.24 that pp(a) = a and pp(a) = 1— 6(@-1(1—a) — £). 
Using the MATLAB code below, we can plot the two ROC curves shown in Figure 9.26. 

% MATLAB code to plot ROC curve 

Sigma = 2; mu = 3; 

alphaset = linspace(0,1,1000) ; 

PF1 = zeros(1,1000); PD1 = zeros(1,1000); 
PF2 = zeros(1,1000); PD2 = zeros(1,1000); 
for i=1:1000 

alpha = alphaset(i); 
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PFi(i) = alpha; 
PD1i(i) = alpha; 


PF2(i) = alpha; 
PD2(i) = 1-normcdf (norminv(1-alpha)-mu/sigma) ; 
end 


figure; 
plot(PF1, PD1,’LineWidth’, 4, ’Color’, [0.8, 0, 0]); hold on; 
plot(PF2, PD2,’LineWidth’, 4, ’Color’, [0, 0, 0]); hold off; 


To compute the AUC we perform a numerical integration: 
auc = | po(a)-dpr(a) * Y pola) Apr(as) 


= > Pv(%) : [pr(ai) _ pr(a-1)], 


where aq; is the ith critical level we use to plot the ROC curve. (We assume that the a’s are 
sorted in ascending order.) In MATLAB, the commands are 


auci = sum(PD1.*[0 diff (PF1)]) 
auc2 = sum(PD2.*[0 diff (PF2)]) 


The AUC of the two decision rules computed by MATLAB are 0.8561 and 0.5005, respec- 
tively. The small slack of 0.0005 is caused by the numerical approximation at the tail, which 
can be ignored as long as you are consistent for all the ROC curves. 

The commands for Python are analogous to the commands for MATLAB. 


# Python code to plot ROC curve 
import numpy as np 

import matplotlib.pyplot as plt 
import scipy.stats as stats 


sigma = 2; mu = 3; 
alphaset = np.linspace(0,1,1000) 
PFi = np.zeros(1000); PD1 = np.zeros(1000) 
PF2 = np.zeros(1000); PD2 = np.zeros(1000) 
for i in range(1000): 

alpha = alphaset [i] 

PF1[i] = alpha 

PD1[i] = alpha 

PF2 [i] alpha 

PD2 [i] 1-stats.norm.cdf(stats.norm.ppf(1-alpha)-mu/sigma) 
plt.plot (PF1,PD1) 
plt.plot (PF2,PD2) 


To compute the AUC, the Python code is (continuing from the previous code): 
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It is possible to get a decision rule that is worse than a blind guess. The following 
example illustrates a trivial setup. 


Practice Exercise 9.6. (Flipped Neyman-Pearson). Consider two hypotheses 


Ho = Gaussian(0, 0”), 


H, = Gaussian(p, 07), p> 0. 
Let a be the critical level. The Neyman-Pearson decision rule is 


5*( \= 1, Ome), 
we i, oy ota). 


Now, consider a flipped Neyman-Pearson decision rule 


Find pr, pp, and AUC for the new decision rule 67. 


Solution. Since we flip the rejection zone, the probability of false alarm is 


pr(a) = / 5*(y)foly) dy 


o®~1(1—a) 


Similarly, the probability of detection is 


ane / OWA 


Co 
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If you plot pp as a function of pr, you will obtain a curve shown in Figure 9.27. 
The AUC for this flipped decision rule is 0.1439, whereas that for Neyman-Pearson is 
0.8561. The two numbers are complements of each other, meaning that their sum is 
unity. 


Neyman-Pearson 


an-Pearson 


0.2 0.4 0.6 0.8 
PF 


Figure 9.27: The ROC curve of a flipped Neyman-Pearson decision rule. 


What if we arbitrarily construct a decision rule that is neither Neyman-Pearson nor 
the blind guess? The following example demonstrates one possible choice. 


Practice Exercise 9.7. Consider two hypotheses 


Ho = Gaussian(0, 0”), 


H, = Gaussian(, 07), p>. 


Let a@ be the critical level. Consider the following decision rule: 


(wea ie =a) 
ha) 
(y) {0 ly| < oO-1(1 — a). 


Find pr, pp, and AUC for the new decision rule 6*. 


Solution. The probability of false alarm is 


Oe / 5* (y) foly) dy 


o®~'(1—a) 1 ae 
all -f e€ 207 dy 
~c-1(1-a) V 2707 


®(@"*(1—a)) + @(-O*(1 


= 2a. 


598 


9.5. ROC AND PRECISION-RECALL CURVE 


Similarly, the probability of detection is 
po(a)= f &®()faly) ay 


o®~1(1-a) il @=oe 
= / Nn? dy 
—o&—1(1-a) V Qo? 


Ea g (etal) (eee) 


Oo 


=1-6(@70-a)-£)+@(-# (1-9) H). 


If you plot pp as a function of pr, you will obtain a curve shown in Figure 9.28. 
The AUC for this proposed decision rule is 0.7534, whereas that of Neyman-Pearson 
is 0.8561. Therefore, the Neyman-Pearson decision rule is better. 


Neyman-Pearson 


proposed decision 


0.4 0.6 0.8 
PF 


Figure 9.28: The ROC curve of a proposed decision rule. 


The MATLAB code we used to generate Figure 9.28 is shown below. Note that we 
need to separate the calculations of the two curves, because the proposed curve can only 
take 0 < a < 0.5. The Python code is implemented analogously. 


% MATLAB code to generate the ROC curve. 
Sigma = 2; mu = 3; 


PFi = zeros(1,1000); PD1 = zeros(1,1000); 
PF2 = zeros(1,1000); PD2 = zeros(1,1000); 
alphaset = linspace(0,0.5,1000) ; 
for i=1:1000 
alpha = alphaset(i); 
PF1(i) = 2*alpha; 
PD1 (i) 1-(normcdf (norminv(1-alpha)-mu/sigma)-... 
normcdf (-norminv(1-alpha)-mu/sigma) ) ; 


end 
alphaset = linspace(0,1,1000); 
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for i=1:1000 

alpha = alphaset(i); 

PF2(i) = alpha; 

PD2(i) = 1-normcdf (norminv(1-alpha)-mu/sigma) ; 
end 
figure; 


plot(PF1, PD1,’LineWidth’, 4, ’Color’, [0.8, 0, 0]); hold on; 
plot(PF2, PD2,’LineWidth’, 4, ’Color’, [0, 0, 0]); hold off; 


import numpy as np 
import matplotlib.pyplot as 
import scipy.stats as stats 


sigma = 2; mu = 3; 
PF1i = np.zeros(1000); PD1 = np.zeros(1000) 
PF2 = np.zeros(1000); PD2 = np.zeros(1000) 


alphaset = np.linspace(0,0.5,1000) 
for i in range(1000): 
alpha = alphaset [i] 
PF1[i] = 2*alpha 
PD1l[i] = 1-(stats.norm.cdf(stats.norm.ppf(1-alpha)-mu/sigma) \ 
-stats.norm.cdf(-stats.norm.ppf (1-alpha) -mu/sigma) ) 


alphaset = np.linspace(0,1,1000) 
for i in range(1000): 
alpha = alphaset [i] 
PF2[i] = alpha 
PD2[i] = 1-stats.norm.cdf(stats.norm.ppf(1-alpha)-mu/sigma) 


plt.plot(PF1, PD1) 
plt.plot(PF2, PD2) 


9.5.3. The ROC curve in practice 


If the Neyman-Pearson decision rule is the optimal rule, why don’t we always use it? The 
problem is that in practice we may not have access to the distributions. For example, if we 
classify images, how do we know that the data follows a Gaussian distribution or a mixture 
of distributions? Consequently, the ROC curves we discussed in the subsections above are 
the theoretical ROC curves. In practice, we plot the empirical ROC curves. 

Plotting an empirical ROC curve for a binary classification method (and hypothesis 
testing) is intuitive. The ingredients we need are a set of scores and a set of labels. 
The scores are the probability values determining the likelihood of a sample belonging to 
one class. Generally speaking, for empirical data this requires looking at the training data, 
building a model, and computing the likelihood. We will not go into the details of how a 
binary classifier is built. Instead, we assume that you have already built a binary classifier 
and have obtained the scores. Our goal is to show you how to plot the ROC curve. 
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The following MATLAB code uses a dataset fisheriris. The code builds a binary 
classifier and returns the scores. 


% MATLAB code to train a classification algorithm. 
% Do not worry if you cannot understand this code. 
% It is not the focus on this book. 

load fisheriris 

pred = meas(51:end,1:2); 


resp = (1:100)’>50; 

mdl = fitglm(pred,resp,’Distribution’,’binomial’,’Link’,’logit’); 
scores = mdl.Fitted.Probability; 

labels = [ones(1,50), zeros(1,50)]; 
save(’ch9_ROC_example_data’,’scores’,’labels’); 


To give you an idea of how the scores of the classifier look, we plot the histogram of 
the scores in Figure 9.29. As you can see, there is no clear division between the two classes. 
No matter what threshold 7 we use, some cases will be misclassified. 
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Figure 9.29: The distribution of probability scores obtained from a binary classifier for the dataset 
fisheriris. The green vertical lines represent the threshold for turning the scores into binary decisions. 
Any score greater than 7 will be classified as Class 1, and any score that is less than 7 will be classified 
as Class 0. These predicted labels would then be compared to the true labels to plot the ROC curve. 


Recall that the ROC curve is a function of pp versus pr. Using terminology from 
statistics, pp is the true positive rate and pp is the false positive rate. By sweeping a range 
of decision thresholds (over the scores), we can compute the corresponding pr’s and pp’s. 
On a computer this can be done by setting up two columns of labels: the true label labels 
and the predicted labels prediction. For any threshold 7, we binarize the scores to turn 
them into a decision vector. Then we count the number of true positives, true negatives, 
false positives, and false negatives. The total of these numbers will give us pr and pp. 

In MATLAB, the above description can be easily implemented by sweeping through 
the range of T. 


% MATLAB code to generate an empirical ROC curve 
load ch9_ROC_example_data 
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tau = linspace(0,1,1000); 


for i=1:1000 
idx = (scores <= tau(i)); 
predict = zeros(1,100); 
predict(idx) = 1; 
true_positive = 0; true_negative = 0; 
false_positive = 0; false_negative = 0; 
for j=1:100 
if (predict (j)==1) && (labels(j)==1) 
true_positive = true_positive + 1; end 


if (predict (j)==1) && (labels(j)==0) 
false_positive = false_positive + 1; end 

if (predict (j)==0) && (labels(j)==1) 
false_negative = false_negative + 1; end 

if (predict (j)==0) && (labels(j)==0) 
true_negative = true_negative + 1; end 


end 

PF(i) = false_positive/50; 

PD (i) 
end 
plot(PF, PD, ’LineWidth’, 4, ’Color’, [0, 0, 0]); 


true_positive/50; 


The Python codes of this problem are similar. We give them here for completeness. 


# Python code to generate an empirical ROC curve 
import numpy as np 
import matplotlib.pyplot as plt 
import scipy.stats as stats 
scores = np.loadtxt(’ch9_ROC_example_data.txt’) 
labels = np.append(np.ones(50), np.zeros(50) ) 
tau = np.linspace(0,1,1000) 
PF = np.zeros(1000) 
PD = np.zeros(1000) 
for i in range(1000): 
idx = scores<= tau[i] 
predict = np.zeros(100) 
predict[idx] = 1 
true_positive = 0; true_negative 
false_positive = 0; false_negative 
for j in range(100): 
if (predict[j]==1) and (labels[j]==1): true_positive 
if (predict[j]==1) and (labels[j]==0): false_positive 
if (predict[j]==0) and (labels[j]==1): false_negative 
if (predict[j]==0) and (labels[j]==0): true_negative 
PF[i] = false_positive/50 
PD[i] = true_positive/50 
plt.plot(PF, PD) 
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The empirical ROC curve for this problem is shown in Figure 9.30. Each point on the 
curve is a coordinate (pr, pp), evaluated at a particular threshold t. Mathematically, the 
decision rule we used was 


1, score(y) > 7, 
d(y) = 
0, score(y) <T. 


For every 7, we have a false alarm rate and a detection rate. Since this is an empirical 
dataset with only 100 samples, there are many occasions where pr does not change but 
pp increases, or pp stays constant but pr increases. For this particular example, we can 
compute the AUC, which is 0.7948. 
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Figure 9.30: The empirical ROC curve for the dataset fisheriris, using a classifier based on the 
logistic regression. 


Note that the empirical ROC is rough. It does not have the smooth concave shape of 
the theoretical ROC curve. One can prove that if the decision rule is Neyman-Pearson, i.e., 
if we conduct a likelihood ratio test, then the resulting ROC curve is concave. Otherwise, 
you can still obtain an empirical ROC curve for real datasets and classifiers. However, the 
shape is not necessarily concave. 


9.5.4 The Precision-Recall (PR) curve 


In modern data science, an alternative performance metric to the ROC curve is the precision- 
recall (PR) curve. The precision and recall are defined as follows. 


Definition 9.4. Let TP = true positive, FP = false positive, FN = false negative. 
The precision is defined as 


i ee 
MO Ie — fan ime 


(9.40) 


precision = 


and the recall is defined as 


Ibe ue 
TP+FN pp+pm 


recall = 
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In this definition, TP, FP, and FN are the numbers of samples that are classified as true 
positive, false positive, and false negative, respectively. However, both precision and recall are 
defined as ratios of numbers. The ratios can be equivalently defined through the rates. Using 
our terminology, this gives us the definitions in terms of pp, pr and py. Since pp = 1—py, 
it also holds that the recall is pp. 

Let us take a moment to consider the meanings of precision and recall. Precision is 


defined as a3 
TP # true positives 


= : 9.42 
TP+FP _ total # positives you claim ( ) 


precision = 


The numerator of the precision is the number of true positive samples and the denominator 
is the total number of positives that you claim. This includes the true positives and the 
false positives. Therefore, precision measures how trustworthy your claim is. There are two 
scenarios to consider: 


e High precision: This means that among all the positives you claim, many of them are 
the true positives. Therefore, whatever you claim is trustworthy. One possibility for 
obtaining a high precision is that the critical level a of the Neyman-Pearson decision 
rule approaches 1. In other words, you are very accepting of the null hypotheses. Thus, 
whenever you reject, it will be a reliable reject. 

e Low precision: This means that you are overclaiming the positives, and so there are 
many false positives. Thus, even though you claim many positives, not all are trust- 
worthy. One reason why low precision occurs is that you are too eager to reject the 
null. Thus you tend to overkill the unnecessary cases. 


A similar analysis can be applied to the recall. The recall is defined as 


TP # true positives 


= ; 9.43 
TP+FN _ total # positives in the distribution ( ) 


recall = 


The difference between the recall and the precision is the denominator. For recall, the 
denominator is the total number of positives in the distribution. We are not interested 
in knowing what you have claimed but in knowing how many of them are there in the 
distribution. If you examine the definition using pp, you can see that recall is the probability 
of detection — how successfully you can detect a target. A high recall and a low recall can 
occur in two situations: 


e High recall: This means that you are very good at detecting the target or rejecting 
the null appropriately. A high recall can happen when the critical level a is low so that 
you never miss a target. However, if the critical level a is low, you will suffer from a 
low precision. 


e Low recall: This means that you are too accepting of the null hypotheses, and so you 
never claim that there is a target. As a result the number of successful detections is 
low. However, having a low recall can buy you high precision because you do not reject 
the null unless it has extreme evidence (hence there is no false alarm.) 


As you can see from the discussions above, the precision-recall has a trade-off, just as 
the ROC curve does. Since the PR curve and ROC curve are derived from pr and pp, there 
is a one-to-one correspondence. This can be proved by rearranging the terms in the previous 
theorem. 
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Theorem 9.3. The false alarm rate pr and the detection rate pp can be expressed 
in terms of the precision and recall as 


recall(1 — precision) 
pe = 55 ’ 
precision 


Pp = recall. 


This result implies that whenever we have an ROC curve we can convert it to a PR curve. 
Moreover, whenever we have a PR curve we can convert it to an ROC curve. Therefore, 
there is no additional information one can squeeze out by converting the curves. What we 
can claim, at most, is that the two curves offer different ways of interpreting the decision 
rule. 

To illustrate the equivalence between an ROC curve and a PR curve, we plot two 
different decision rules in Figure 9.31. Any point on the ROC curve will have a corresponding 
point on the PR curve, and vice versa. 
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Figure 9.31: There is a one-to-one correspondence between the ROC curve and the PR curve. 


The MATLAB and Python codes for generating the PR curve are straightforward. 
Assuming that we have run the code used to generate Figure 9.28, we plot the PR curve as 
follows (this will give us Figure 9.31). 


% MATLAB code to generate a PR curve 
precision! = PD1./(PD1+PF1) ; 
precision2 = PD2./(PD2+PF2) ; 

recalli = PD1; 


recall2 PD2; 
plot(recalli, precisionl, ’LineWidth’, 4); hold on; 
plot(recall2, precision2, ’LineWidth’, 4); hold off; 
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Practice Exercise 9.8. Suppose that the decision rule is a blind guess: 
with probability a, 
with probability 1 — a, 

Plot the ROC curve and the PR curve. 


Solution: As we have shown earlier, pr(a) and pp(q) for this decision rule are pr(@) = 
a and pp(a) = a. Therefore, 


a 1 
precision = a and recall = pp =a. 


Po+Pr a+a 


Thus the PR curve is a straight line with a level of 0.5. 
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Figure 9.32: The PR curve of a blind-guess decision rule is a straight line. 


Practice Exercise 9.9. Convert the ROC curve in Figure 9.30 to a PR curve. 


Solution: The conversion is done by first computing pr and pp. Defining the precision 
and recall in terms of pr and pp, we plot the PR curves below. 
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Figure 9.33: The PR curve of a real dataset. 
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As you can see from the figure, the PR curve behaves very differently from the 
ROC curve. It is sometimes argued that the two curves can be interpreted differently, 
even though they describe the same decision rule for the same dataset. 


9.6 Summary 


In this chapter, we have discussed five principles for quantifying the confidence of an esti- 
mator and making statistical decisions. To summarize the chapter, we clarify a few common 
misconceptions about these topics. 


e Confidence interval. Students frequently become confused about the meaning of a 
confidence interval. It is not the interval that 95% of the samples will fall inside. It 
is also not the interval within which the estimator has a 95% chance to show up. 
A confidence interval is a random interval that has a 95% chance of including the 
population parameter. A better way to think about a confidence interval is to think of 
it as an alternative to a point estimate. A point estimate only gives a point, whereas 
a confidence interval extends the point to an interval. All the randomness of the point 
estimate is also there in the confidence interval. However, if the confidence interval is 
narrow, there is a good chance for the point estimate to be accurate. 


e Bootstrapping. The most common misconception about bootstrapping is that it can 
create something from nothing. Another misconception is that bootstrapping can make 
your estimates better. Both beliefs are wrong. Bootstrapping is a technique for esti- 
mating the estimator’s variance, and consequently it provides a confidence interval. 
Bootstrapping does not improve the point estimate, no matter how many bootstrap- 
ping samples you synthesize. Bootstrapping works because the sampling with the re- 
placement step is equivalent to drawing samples from the empirical distribution. The 
whole process relies on the proximity between the empirical distribution and the true 
population. If you do not have enough samples and the empirical distribution does not 
approximate the population, bootstrapping will not work. Therefore, bootstrapping 
does not create something from nothing; it uses whatever you have and tells you how 
reliable the estimate is. 


e Hypothesis testing. Students are often overwhelmed at first by the great number of 
tests one can use for hypothesis testing, e.g., p-value, critical value, Z-test, T-test, y? 
test, F-test, etc. Our advice is to forget about them and remember that hypothesis 
testing is a court trial. Your job is to decide whether you have enough evidence to 
declare that the defendant is guilty. To reach a guilty verdict, you need to make sure 
that the test statistic is unlikely to happen. Therefore, the best practice is to draw 
the distributions of the test statistic and ask yourself how likely is it that the test 
statistic has such a value. When you draw the pictures of the distributions, you will 
know whether you should use a Gaussian Z, a Student’s t, a y?, a F-statistic, etc. 
When you examine the likelihood of the test statistic, you will know whether you want 
to use the p-value or the critical value. If you follow this principle, you will never be 
confused by the oceans of tests you find in the textbooks. 


607 


CHAPTER 9. CONFIDENCE AND HYPOTHESIS 


Neyman-Pearson. Beginners often find Neyman-Pearson abstract and do not under- 
stand why it is useful. In this chapter, however, we have explained why we need to 
understand Neyman-Pearson. It is a very general framework for many kinds of hy- 
pothesis testing problems. All it says is that if we want to maximize the detection rate 
while maintaining the false alarm rate, then the optimal testing procedure boils down 
to the critical-value test and the p-value test. This gives us a certificate that our usual 
hypothesis testing is optimal according to the Neyman-Pearson framework. 


ROC and PR curves. On the internet nowadays there is a huge quantity of articles, 
blogs, and tutorials about how to plot the ROC curve and the PR curve. Often these 
curves are explained through programming examples such as Python, R, or MATLAB. 
Our advice for studying the ROC curve and the PR curve is to go back to the Neyman- 
Pearson framework. These two curves do not come out of the blue. The ROC curve is 
the natural figure explaining the objective and the constraint in the Neyman-Pearson 
framework. By changing the coordinates, we obtain the PR curve. Therefore, the two 
curves are the same in terms of the amount of information, but they offer different 
interpretations. 
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9.8 Problems 


Exercise 1. 

Consider i.i.d. Gaussian random variables X1,..., Xj with an unknown mean 6 and a known 
variance o? = 1. Suppose N = 30. Find the confidence level 1—a for the confidence intervals 
of the mean O: 


(a) T= [9 22,0 2.140 


VN 
(b) Z = [6 — 183,64 1880] 
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Exercise 2. 
Suppose that we have conducted an experiment with N = 100 samples. A 95% confidence 
interval of the mean was 0.45 < pw < 0.82. 


(a) Would a 99% confidence interval calculated from the sample data be wider or narrower? 


(b) Is it correct to interpret the confidence interval as saying that there is a 95% chance 
that ys is between 0.49 and 0.82? You may answer yes, no, or partially correct. Explain. 


(c) Is it correct to say that if we conduct the experiment 1000 times, there will be 950 
confidence intervals that will contain 4? You may answer yes, no, or partially correct. 
Explain. 


Exercise 3. 
Suppose that we have conducted an experiment. We know that 0 = 25. We obtained N = 20 
samples and found that the sample mean is 0 = 1014. 


(a) Construct a 95% two-sided confidence interval of 0. 


(b) Construct a 95% one-sided confidence interval (the lower tail) of ©. 


Exercise 4. 
Let X1,...,Xy be iid. Gaussian with X, ~ Gaussian(0,1). Let Y, = e*”, and suppose 
we have N = 100 samples. We want to compute a 95% confidence interval for skewness. 


(a) Randomly subsample the dataset with B = 30 samples. Repeat the exercise 5 times. 
Plot the resulting histograms using MATLAB or Python. 


(b) Repeat (a) for 14 = 500 times and compute the 95% bootstrapped confidence interval 
of the skewness. 


(c) Try using a larger B = 70 and a smaller B = 10. Report the 95% bootstrapped 
confidence interval of the skewness. 


Exercise 5. 7 
Let X1,...,Xy be iid. uniform with X, ~ Uniform(0,6). Let 0 = max{X1,...,Xy}. 
Generate a dataset of N = 50 with 6 = 1. 


(a) Find the distribution of the estimator 6. 
(b) Show that P[O = 6] = 1 —(1—(1/n))%. Thus, as N + co, we have P[O = 6] = 0. 


(c) Use Python or MATLAB to generate the histogram of © from bootstrapping. How 
does the bootstrapped histogram look as N grows? Why? 


Exercise 6. 
Let X be a Gaussian random variable with unknown mean and unknown variance. It was 
found that with N = 15, 


N N 
S > Xn = 250, 5 > X? = 10000. 
n=1 


n=1 
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Find a 95% confidence interval of the mean of X. 


Exercise 7. 
Let © be the sample mean of a dataset containing N samples. It is known that the samples 
are drawn from Gaussian(6, 37). Find N such that 


P[O-1<0< 641] =0.95. 


Exercise 8. 
Which of the following statements are valid hypothesis testing problems? 


Hop: 0 > 10 and Hy: o = 10. 
(c) Ho: X = 50 and H,: X #50. 


Exercise 9. 

It is claimed that the mean is 6 = 12 with a standard deviation 0.5. Consider Ho: 6 = 12 
and H,: 6 < 12. Ten samples are obtained, and it is found that 0 = 13.5. With a 95% 
confidence level, should we accept or reject the null hypothesis? 


Exercise 10. 
Consider a hypothesis testing problem: Ho: 6 = 175 versus an alternative hypothesis Hy: 
9 > 175. Assume N = 10 and o = 20. 


(a) Find the type 1 error if the critical region is © > 185. 


(b) Find the type 2 error if the true mean is 195. 


Exercise 11. 
Consider Ho: 8 = 30000 versus an alternative hypothesis H;: 8 > 30000. Suppose N = 16, 
and let o = 1500. 


(a) If we want a = 0.01, what is z,? 


(b) What is the type 2 error when 6 = 31000? 


Exercise 12. 
Let W,, ~ Gaussian(0,07), and consider two hypotheses: 


Ho: Xn =90+ Wr, n=1,...,N, 
Ay: Xn =6,4+ Wh, WH Aly sag. 


Let X = (1/N) 0*_, Xn. 
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(a) Show that the likelihood of observing X1,...,X,~ given Ho is 


1 Re ; 
fx («|Ho) = GnatyR7e oP ) ~Fo3 S > (Xn — 90)? p- 
n=1 


(b) Find the likelihood fx (a|H1) of observing X1,...,Xw~ given Hj. 


(c) The likelihood ratio test states that 


fx(@|Fi) Sm 
fx(x|Ho) ~% 


Show that the likelihood ratio test is given by 


—— Ao + 0, o? log Tr 
xX > . 
He N (01 — 9) 
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Random Processes 


In modern data science, many problems involve time. The stock market changes every 
minute; a speech signal changes every millisecond; a car changes its steering angle constantly; 
the examples are endless. A common theme among all these examples is randomness. We 
do not know whether a stock will go up or down tomorrow, although we may be able to 
make some predictions based on previous observations. We do not know the next word of a 
sentence, but we can guess based on the context. Random processes are tools that can be 
applied to these situations. We treat a random process as an infinitely long vector of random 
variables where the correlations between the individual variables define the statistical prop- 
erties of the process. If we can determine these correlations, we will be able to summarize 
the past and predict the future. 

The objective of this chapter is to introduce the basic concepts of random processes. 
Given the breadth of the subject, we can only cover the most elementary results, but they 
are sufficient for many engineering and data science problems. However, there are complex 
situations for which these elementary results will be insufficient. The references at the end 
of this chapter contain more in-depth discussions of random processes. 


Plan of this chapter 


We begin by outlining the definition of random processes and ways to characterize their 
randomness in Section 10.1. In Section 10.2 we discuss the mean function, the autocorrelation 
function, and the autocovariance function of a random process. In Section 10.3 we look at 
a special subclass of random processes known as the wide-sense stationary processes. Wide- 
sense stationary processes allow us to use tools in the Fourier domain to make statistical 
statements. Based on wide-sense stationary processes, we discuss power spectral density in 
Section 10.4. With this concept, we can ask what will happen to the random process when we 
pass it through a linear transformation. In Section 10.5 we discuss such interactions between 
the random process and a linear time-invariant system. Finally, we discuss a practical usage 
of random processes in the subject of optimal linear filters in Section 10.6. 
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10.1 Basic Concepts 


10.1.1 Everything you need to know about a random process 


Here is the single most important thing you need to remember about random processes: 


What is a random process? 


A random process is a function indexed by a random key. 


That’s it. Now you may be wondering what exactly a “function indexed by a random key” 
means. To help you see the picture, we consider two examples. 


Example 10.1. We consider a set of straight lines. We define two random variables a 
and b that are uniformly distributed in a certain range. We then define a function: 


f(t)=at+b, -2<t<2. (10.1) 


Clearly, f(t) is a function of time t. But since a and b are random, f(t) is also random. 
The randomness is caused by a and b. To emphasize this dependency, we write f(t) as 


f(t,€) = a(€)t + d(€), —2<¢<2, 


where € € 2 denotes the random index of the constants (a,b) and 2 is the sample 
space of €. Therefore, by picking a different pair of constants (a(€), b(€)), we will have 
a different function f(t,€), which in our case is a straight line of different slope and 
y-intercept. 


Figure 10.1: The set of straight lines f(a) = az + b where a,b € R. 


As a special case of the example, suppose that the sample space contains only 
two pairs of constants: (a,b) = (1.2,0.6) and (a,b) = (—0.75, 1.8). The probability of 
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getting either pair is $. Then the function f(t, €) will take two forms: 


1.2¢ + 0.6, with probability 3 


t,6) = 
F(4§) ee + 1.8, with probability 5. 


Every time you pick a sample you pick one of the two functions, either f(t,&.) or 
f(t, €2). So we say that f(t, €) is a random process because it is a function f(t) indexed 
by a random key €. 


Example 10.2. This example studies the function 
f(t) = cos(wot + 9), -1<t<l, 


where © is a random phase distributed uniformly over the range [0, 27]. Depending on 
the randomness of 0, the function f(t) will take a different phase offset. To emphasize 
this dependency, we write 


f(t, €) = cos(wot + O(E)), -1<t<l. (10.2) 


= 0.5 
t 
Figure 10.2: The set of phase-shifted cosines f(t) = cos(wot + 0) where 6 € [0, 27]. 


Again, € denotes the index of the random variable ©. Since O is drawn uniformly 
from the interval [0,27], the following functions are two possible realizations: 


3 
f(t, &1) = cos (wot + _ » =lere1, 


Hots) =o 


Just as with the previous example, f(t) is a function indexed by a random key €. 


These two examples should give you a feeling for what to expect from a random process. 
A random process is quite similar to a random variable because they are both contained 
in a certain sample space. For (discrete) random variables, the sample space is a collection 
of outcomes {€,&,...,€v}. The random variable X : F — R is a mapping that maps 
&, to X(Ep), where X(€,) is a number. For random processes, the sample space is also 
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{&1,&2,...,€n}. However, the mapping X does not map €, to a number X(€,) but to a 
function X(t, €,). A function has the time index t, which is absent in the number. Therefore, 
for the same €,, X(t1,&,) can take one value and X(t2,&,) can take another value. 


X(t, &1) 


Figure 10.3: The sample space of a random process X(t,€) contains many functions. Therefore, each 
random realization is a function. 


Figure 10.3 shows the sample space of a random process. Each outcome in the sample 
space is a function. The probability of getting a function is specified by the probability 
mass or the probability density of the associated random key €. If you put your hand into 
the sample space, the sample you pick will be a function that will change with time and is 
indexed by the random key. From our discussions of joint random variables in Chapter 5, 
you can think of the function as a vector. When you pull a sample from the sample space, 
you pull the entire vector and not just an element. 


10.1.2 Statistical and temporal perspectives 


Since a random process is a function indexed by a random key, it is a two-dimensional object. 
It is a function both of time t and of the random key €. That’s why we use the notation 
X(t,€) to denote a random process. These two axes play different roles, as illustrated in 
Figure 10.4. 

Temporal perspective: Let us fix the random key at € = &9. This gives us a function 
X(t,&). Since € is already fixed at £0, we are looking at a particular realization drawn 
from the sample space. This realization is expressed as a function X(t,&), which is just 
a deterministic function that evolves over time. There is no randomness associated with 
it. This is analogous to a random variable. While X itself is a random variable, by fixing 
the random key € = 9, X(&) is just a real number. For random processes, X(t, &)) now 
becomes a function. 

Since X(t, €o) is a function that evolves over time, we view it along the horizontal axis. 
For example, we can study the sequence 


X (ti, 0), X (te, €0), eves ,X (tx, £0), 


where t),...,t« are the time indices of the function. This sequence is deterministic and is 
just a sequence of numbers, although the numbers evolve as t changes. 


Statistical perspective: The other perspective, which could be slightly more abstract, 
is the statistical perspective. Let us fix the time at t = tp. The random key € can take any 
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(a) Temporal perspective (b) Statistical perspective 


Figure 10.4: Temporal and statistical perspectives of a random process. For the temporal perspective 
(which we call the horizontal perspective), we fix the random key € and look at the function in time. 
For the statistical perspective (which we call the vertical perspective), we fix the time and look at the 
function at different random keys. 


state defined in the sample space. So if the sample space contains {&1,...,€ }, the sequence 
{X (to, &1),-.-, X(to,En)} is a sequence of random variables, because the €’s can go from 
one state to another state. 

A good way to visualize the statistical perspective is the vertical perspective in which 
we write the sequence as a vertical column of random variables: 


X (to, &1) 
X (to, €2) 


X (to, €w) 


That is, if you fix the time at t = to, you are getting a sequence of random variables. The 
probability of getting a particular value X(t 9) depends on which random state you land on. 


Why do we bother to differentiate the temporal perspective and the statistical per- 
spective? The reason is that the operations associated with the two are different, even if 
sometimes they give you the same result. For example, if we take the temporal average of 
the random process, we get a number: 


ai 
X(é) = al X(t, €) dt. (10.3) 


We call this the “temporal average” because we have integrated the function over time. The 
resulting value will not change with time. However, X(€) depends on the random key you 
provide. If you pick a different random realization, X(€) will take a different value. So the 
temporal average is a random variable. 

On the other hand, if we take the statistical average of the random process, we get 


E[X(1)] = [ X(t£) pl) a€, (10.4) 
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where p(£) is the PDF of the random key €. We call this the statistical average because we 
have taken the expectation over all possible random keys. The resulting object E[X(t)] is 
deterministic but a function of time. 

No matter how you look at the temporal average or the statistical average, they are 
different with the following exception: that X(€) = const and E[X(t)] = const, for example, 
X(€) = E[X(t)] = 0. This happens only for some special (and useful) random processes 
known as ergodic processes that allow us to approximate the statistical average using the 
temporal average, with some guarantees derived from the law of large numbers. We will 
return to this point later. 


Example 10.3. Let A ~Uniform|(0, 1]. Define 


X(t, €) = A(&) cos(2zt). 


In this example, the magnitude A(&) is a random variable depending on the 
random key €. For example if we draw &1, perhaps we will get a value A(é,) = 0.5. 
Then X(t,&:) = 0.5cos(2zt). To take another example, if we draw £2, we may get 
A(f2) = 1. Then X(t, 2) = 1cos(2zt). Figure 10.5 shows a few random realizations 
of the cosines. We can look at X(t,€) from the statistical and the temporal views. 


Figure 10.5: Five different realizations of the random process X(t) = Acos(2rt). 


e Statistical View: Fix t (for example t = 10). In this case, we have 


which is a random variable because cos(207) is a constant. The randomness of 
X comes from the fact that A(€é) ~ Uniform|0, 1]. 


e Temporal View: Fix € (for example A(é) = 0.7). In this case, we have 
X(t,€) = 0.7 cos(27t), 


which is a deterministic function of t. 
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Example 10.4. Let A be a discrete random variable with a PMF 


P(A = +1) = ; ane eee - 


We define the function X[n,€] = A(€)(—1)”. In this example, A can only take two 
states. If A = +1, then X[n, €] = (—1)". If A= —1, then X[n, €] = (-1)""". 


1.5 


1.5 


—0 X,(n) —0 X,(n) 


i 2 = ays o 1 2 38 
Figure 10.6: Realizations of the random process X[n] = A(—1)”. 


The graphical illustration of this example is shown in Figure 10.6. Again, we can 
look at X[n, &] from two views. 


e Statistical View: Fix n, say n = 10. Then, 


iG) = 


(8 S41, with prob 1/2, 
(-1)" = -1, with prob 1/2, 


which is a Bernoulli random variable. 


e Temporal View: Fix €. Then, 
if A=+1, 
if A=-—l, 
which is a time series. 


In this example, we see that the sample space of X(n, €) consists of only two functions 
with probabilities 


P(X In] = (-1)"*) = 5, 


Therefore, if there is a sequence outside the sample space, e.g., 
P(X[mpJ=[1 1 1 -1 1 -1 --- ])=0 


then the probability of obtaining that sequence is 0. 
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What do we mean by statistical average and temporal average? 


e Statistical average: Take the expectation of X(t,€) over €. This is the vertical 
average. 


e Temporal average: Take the expectation of X(t, €) over t. This is the horizontal 


average. 


e In general, statistical average # temporal average. 


10.2 Mean and Correlation Functions 


Given a random variable, we often want to know the expectation and variance, and often 
we also want to know the expectation and variance for the random processes. Nevertheless, 
we need to consider the time axis. In this section, we discuss the mean function and the 
autocorrelation function. 


10.2.1 Mean function 


Definition 10.1. The mean function x(t) of a random process X(t) is 


Let’s consider the “expectation” of X(t). Recall that a random process is actually X(t, &) 
where € is the random key. Therefore, the expectation is taken with respect to €, or to state 
it more explicitly, 


yx(t) = E[X(6)] = a X(t,€) plO) dé, 


where p(&) is the PDF of the random key. This is an abstract definition, but it is not difficult 
to understand if you follow the example below. 


Example 10.5. Let A ~Uniform|(0, 1], and let X(t) = Acos(2zt). Find px(t). 


Solution. The solution to this problem is actually very simple: 


{| A cos(2zrt)] 


= cos(27t)E[A] = 5 cos(2rt) 


So the answer is jux(t) = 4 cos(27t). 


We can link the equations to the definition more explicitly. To do so, we rewrite 
X(t) as 
X(t, €) = A(&) cos(2zt). 
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Then we take the expectation over A: 
1 
px (t) = | X(t, a) pa(a) da = | acos(2nt) -1da 
Q 0 


le 
= cos(2rt) B = —cos(27t). 
aD 


2 15 +1 -0.5 0 0.5 1 es) 2 


Figure 10.7: The mean function of X(t) = Acos(2zt). 


An illustration is provided in Figure 10.7, in which we observe many random 
realizations of the random process X(t,€). On top of these, we also see the mean 
function. The way to visualize the mean function is to use the statistical perspective. 
That is, fix a time t and look at all the possible values that the function can take. For 
example, if we fix t = to, then we will have a set of realizations of one random variable: 


{on cos(2rto), 0.58 cos(27to), ..., 0.93 con(2nto) | — take expectation 


Therefore, when we take the expectation, it is that of the underlying random variable. 
If we move to another timestamp t = t,, we will have a different expectation because 
cos(27tg) now becomes cos(27t;). 


The MATLAB/Python codes used to generate Figure 10.7 are shown below. You can 
also replace the line 0.5*cos(2*pi*t) by the mean function mean(X) (in MATLAB). 


MATLAB code for Example 10.5 

= zeros(1000, 20) ; 
linspace(-2,2,1000) ; 
i=1:20 


X(:,i) = rand(1)*cos(2*pi*t) ; 
end 
plot(t, X, ’LineWidth’, 2, ’Color’, [0.8 0.8 0.8]); hold on; 
plot(t, 0.5*cos(2*pi*t), ’LineWidth’, 4, ’Color’, [0.6 0 0]); 


# Python code for Example 10.5 
x = np.zeros((1000, 20) ) 
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t = np.linspace(-2,2,1000) 
for i in range(20): 

x[:,i] = np.random.rand(1)*np.cos(2*np.pi*t) 
plt.plot(t,x,color=’ gray’) 
plt.plot(t,0.5#np.cos(2*np.pi*t) ,color=’ red’) 
plt.show() 


Example 10.6. Let 0 ~Uniform[—7, 7], and let X(t) = cos(wt + 0). Find px (t). 


Solution. 


7 


[cos(wt + O)] = i cos(wt + 6) - = do = 0. 


1 


Again, as in the previous example, we can try to map this simple calculation with the 
definition. Write X(t) as 
X(t,€) = cos(wt + O()). 


Then the expectation is 


mia i: geste Opes 


1 
=| cos(wi + @) - 5— dO = 0. 
a 1 


2 -1.5 -1 -0.5 0 0.5 1 155 2 
Figure 10.8: The mean function of X(t) = cos(wt + @). 


Figure 10.8 illustrates the random realizations for X(t) = cos(wt + ©) and the 
mean function. The zero mean should not be a surprise because if we take the statistical 
average (the vertical average) across all the possible values at any time instant, the 
positive and negative values of the realizations will make the mean zero. 

We should emphasize that the statistical average is not the same as the temporal 
average, even if they give you the same value. Why do we say that? If we calculate 
the temporal average of the function cos(wt + 69) for a specific value O = 0p, then we 
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have 


JE 


assuming that T is a multiple of the cosine period. This implies that the temporal 
average is zero, which is the same as the statistical average. This gives us an example 
in which the statistical average and the temporal average have the same value, although 
we know they are two completely different things. 


mr Ra at 
x z/ cos(wt + 69) dt = 0, 
0 


The MATLAB/Python codes used to generate Figure 10.8 are shown below. 


MATLAB code for Example 10.6 
= zeros(1000, 20) ; 
= linspace(-2,2,1000); 
for i=1:20 
X(:,i) = cos(2*pi*t+2*pi*rand(1)); 
end 
plot(t, X, ’LineWidth’, 2, ’Color’, [0.8 0.8 0.8]); hold on; 
plot(t, O*cos(2*pi*t), ’LineWidth’, 4, ’Color’, [0.6 0 0]); 


# Python code for Example 10.6 
x = np.zeros((1000,20)) 
t = np.linspace(-2,2,1000) 
for i in range(20): 
Theta = 2*np.pi*(np.random.rand(1)) 
x[:,i] = np.cos(2*np.pi*t+Theta) 
plt.plot(t,x,color=’ gray’) 
plt.plot(t,np.zeros((1000,1)),color=’red’ ) 
plt.show() 


Example 10.7. Let us consider a discrete-time random process. Let X[n] = S”, where 
S ~Uniform{0, 1]. Find jx [n]. 
1 
Ye] = | s” ds = 
0 


In this example the randomness goes with the constant s. Thus, if we write X[n] as 


Xn, €] = [S(g)]”, 


the expectation is 


:[X [nl] = i s"ps(s) ds = | “Plae= 


n+l 


The graphical illustration is provided in Figure 10.9. 
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Nihitdtscaddsacas 


0 5 10 15 20 


Figure 10.9: The mean function of X[n] = S”, where S ~ Uniform/0, 1]. 


The MATLAB code used to generate Figure 10.9 is shown below. We skip the Python 
implementation because it is straightforward. 


% MATLAB code for Example 10.7 
t = 0:20; 
for i=1:20 
X(:,i) = rand(1).7t; 
end 
stem(t, X, ’LineWidth’, 2, ’Color’, [0.8 0.8 0.8]); hold on; 
stem(t, 1./(tt+1), ’LineWidth’, 2, ’MarkerSize’, 8); 


10.2.2. Autocorrelation function 


In random processes, the notions of “variance” and “covariance” are trickier than for random 
variables. Let us first define the concept of an autocorrelation function. 


Definition 10.2. The autocorrelation function of a random process X(t) is 


Rx (ti, t2) = E[X(t)X (t2)]- (10.6) 


Rx (ti, t2) is not difficult to calculate — just integrate X(t1)X(t2) using the appropriate 
PDFs. 


Example 10.8. Let A ~Uniform|(0, 1], X(t) = Acos(27t). Find Rx (t, ta). 


Solution. 


Rx (t,t) = E[Acos(27t,)A cos(2rt2)| 


1 
E[A?] cos(2mt1) cos(27t2) = 3 cos(2nt,) cos(27t2). 
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Example 10.9. Let 0 ~Uniform|—7, 7], X(t) = cos(wt + ©). Find Rx (ty, te). 


Solution. 


Rx (t1, t2) = E [cos(wt; + ©) cos(wtz + O)] 


Tv 


1 
=e cos(wt; + 0) cos(wt2 + 6) dé 


(ade FP al 
oe J. 8 


= : cos (w(t = t2)), 


costutt +t) + 20) + cos(w(t; — t2))| dé 


where in (a) we applied the trigonometric formula: 


1 
cos Acos B = 5 [cos(A + B) + cos(A — B)], 


As you can see, the calculations are not difficult. The tricky thing is the interpretation 
of Rx (ti, ta). 


How do we understand the meaning of E[X(t1)X(t2)|? 


E [X (t1)X (t2)] is analogous to the correlation E[XY] between two random variables 
X and Y. 


The autocorrelation function E [X (t,)X(t2)] is analogous to the correlation E[XY] in rela- 
tion to a pair of random variables. In our discussions of E[XY], we mentioned that E[|XY] 
could be regarded as the inner product of two vectors, and so it is a measure of the closeness 
between X and Y. Now, if we substitute X and Y with X(t,) and X(t2) respectively, then 
we are effectively asking about the closeness between X(t,) and X(t2). So, in a nutshell, the 
autocorrelation function tells us the correlation between the function at two different time 
stamps. 

What do we mean by the correlation between two timestamps? Remember that X (t,) 
and X(t2) are two random variables. Consider the following example. 


Example 10.10. Let X(t) = Acos(2zt), where A ~ Uniform[0, 1]. Find E 


Solution. If X(t) = Acos(2zt), then 


X(0) = Acos(0) = A, 
X (0.5) = Acos(m) = —A. 


When you have two random variables, you consider their correlations. Using this ex- 
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ample, we have that 


A picture will reveal what is happening. Figure 10.10 presents the realizations of the 
random process X(t) = Acos(2mt). If we consider X(0) and X(0.5), each of them is a 
random variable, and thus we can ask about their PDFs. It is obvious from the illustration 
that the random variable X(0) has a PDF that is a uniform distribution from 0 to 1, 
whereas the random variable X (0.5) has a PDF that is a uniform distribution from —1 to 0. 
Mathematically, the PDF's are 


af 8 d ec. mie 
fx(o) (x) {e otherwise aD Fx(0.5) (x) {i otherwise. 


Since X(0) and X(0.5) have their own PDFs, we can calculate their correlation. This will 
give us E[X (0). (0.5)] which after some calculations is ELX (0)X (0.5)] = —4. 


2 -1.5 -1 -0.5 0 0.5 1 1.5 2 


Figure 10.10: The autocorrelation between X(0) and X(0.5) should be regarded as the correlation 
between two random variables. Each random variable has its own PDF. 


We can now consider the autocorrelation for any tj and tg. When you are evaluating 
the autocorrelation function, you are not just evaluating at t = 0 and t = 0.5, you are 
also evaluating the correlation for all pairs of t; and tg. Now you want to know what the 
correlation is between t = 0 and t = 0.5, t = 2 and t = 3.1, etc. Of course, there are 
infinitely many pairs of time instants. The point of the autocorrelation function is to tell 
you the correlation of all the pairs. In other words, if we tell you Rx (t1, t2), you will be able 
to plug in a value of t; and a value of tz and tell us the correlation at (t1, t2). How is this 
possible? To find out, let’s consider the following example. 
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Example 10.11. Let A ~Uniform(0, 1], X(t) = Acos(27t). Find Rx (0,0.5), and draw 
Rx (ti, t2). 


Solution. From the previous example, we know that 


1 
Rx (ti, te) = 3 cos(2nt,) cos(27t2). 


Therefore, Rx(0,0.5) = 3 cos(270) cos(270.5) = —3, which is the same as if we had 
computed it from the first principle. 

The autocorrelation function tells you how one point of a time series is correlated 
with another point of the time series. If Rx (1, t2) gives a high value, then it means the 
random variables at t; and tz have a strong correlation. To understand this, suppose 
we let t; = 0, and let us vary tg. Then 


1 
Rx (0, t2) = 3 cos(270) cos(27t2) 
1 
— cos(27t2). 


This is a periodic function that cycles through itself whenever tz is an integer. As 
we recall from Figure 10.10, if t2 = 0.5, the random variable X(t2) will take only 
the negative values, but otherwise it is correlated with X(0). On the other hand, if 
ty = 0.25, then Figure 10.10 says that the random variable X(t2) is a constant 0, and 
so the correlation with X (0) is zero. 

Clearly, Rx (ti, t2) is a 2-dimensional function of t; and tg. You need to tell Rx 
which of the two time instants you want to compare, and then Rx will tell you the 
correlation. So no matter what happens, you must specify two time instants. Because 
Rx (ti, tg) is a 2-dimensional function, we can visualize it by calculating all the possible 
values it takes. For example, if Rx (t,,t2) = + cos(27t,) cos(27tz), we can plot Rx as 


3 
a function of t; and tg. Figure 10.11 shows the plot. 


a 


“1 --0.75 -0.5 -0.25 0 0.25 05 0.75 1 


1 1.5 2 4 


Figure 10.11: The autocorrelation function Rx(ti,t2) = 4 cos(27t1) cos(27t2). 


627 


CHAPTER 10. RANDOM PROCESSES 


The MATLAB /Python code for Figure 10.11 is shown below. 


% MATLAB code for Example 10.11 

t = linspace(-1,1,1000); 

R = (1/3) *cos(2*pixt(:)).*cos(2*pixt) ; 
imagesc(t,t,R); 


# Python code for Example 10.11 

import numpy as np 

import matplotlib.pyplot as plt 

t = np.linspace(-1,1,1000) 

R = (1/3)*np.outer(np.cos(2*np.pi*t), np.cos(2*np.pix*t) ) 
plt.imshow(R, extent=[-1, 1, -1, 1]) 

plt.show() 


To understand the 2D function shown on the right hand side of Figure 10.11, we can 
take a closer look by drawing Figure 10.12. For any two time instants t; and tz, we have 
two random variables X (t,) and X(t2). The joint expectation ELX (t,)X (t2)] will return us 
some value, and this is a point in the 2D plot Rx(t,,t2). The value tells us the correlation 
between X(t,) and X(t2). In the example in which t, = 0 and t2 = 0.5, the correlation is 
—}. Interestingly, if we pick another pair of time instants t; = —0.5 and to = 0, the joint 
expectation is ELX (—0.5).X(0)] = —4, which is the same value. However, this —} is located 
at a different valley than E[X (0)X(0.5)] in the 2D plot. 


1 
1 075 05 0.25 4 025 05 075 | 


tg = 0.5 X (ti) 


Figure 10.12: To understand the autocorrelation function, pick two time instants t; and tz, and then 
evaluate the joint expectation E[X (t)X (ta)]. 


The above example shows a periodic autocorrelation function. The fact that it is peri- 
odic is coincidental because the random process X(t) is a periodic function. In general, an 
arbitrary random process can have an arbitrary autocorrelation function that is not periodic. 
There are, of course, various properties of the autocorrelation functions and special types 
of autocorrelation functions. We will study one of them, called the wide-sense stationary 
processes, later. 
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Example 10.12. Let 9 ~Uniform|—7, 7], X(t) = cos(wt + ©). Draw the autocorrela- 
tion function Rx (t1, t2). 


Solution. From the previous example we know that 


1B (lq) — 5 (w(t - t2)), 
Figure 10.13 shows the realizations, and the mean and autocorrelation functions. 

Note that the autocorrelation function has a structure: Every row is a shifted 
version of the previous row. We call this a Toeplitz structure. An autocorrelation with 
a Toeplitz structure is specified once we know any of the rows. A Toeplitz structure also 
implies that the autocorrelation function does not depend on the pair (t1, t2) but only 
on the difference t; — tz. In other words, Rx (0,1) is the same as Rx (11.6, 12.6), and so 
knowing Rx (0,1) is enough to know all Rx (to, to +t). Not all random processes have 
a Toeplitz autocorrelation function. Random processes with a Toeplitz autocorrelation 
function are “nice” processes that we will study in detail later. 


1 


“1 --0.75 -0.5 -0.25 0 0.25 05 0.75 1 


q 


Figure 10.13: The autocorrelation function Rx (ti, t2) = + COS (w(t — ta). 


The MATLAB code used to generate Figure 10.13 is shown below. 


MATLAB code for Example 10.12 
= linspace(-1,1,1000); 
= Toeplitz(0.5*cos(2*pixt(:))); 


imagesc(t,t,R); 
grid on; 
xticks(-1:0.25:1); 
yticks(-1:0.25:1); 
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Practice Exercise 10.1. Let 0 ~Uniform|[0, 27], X(t) = cos(wt + ©). Find the PDF 
of X (0). 


Solution. Let Z = X(0) = cos0. Then the CDF of Z is 


Fz(z) =P[Z < Z| 
P[cos O < 2] 


Picos +z < @ < 2r-—cos! z] 


Then by the fundamental theorem of calculus, 


1 


fz(z) = Seen 


A similar concept to the autocorrelation function is the autocovariance function. The 
idea is to remove the mean before computing the correlation. This is analogous to the 
covariance Cov(X,Y) = E[(X — ux)(Y — py)] as opposed to the correlation E[XY] in the 
random variable case. 


Definition 10.3. The autocovariance function of a random process X(t) is 


Cx (ti, t2) = E[(X (ti) — wx (t1)) (X (ta) — wx (t2))I- (10.7) 


As one might expect, the autocovariance function is closely related to the autocorrelation 
function. 


Theorem 10.1. 


Cx (ti, te) = Rx(t, ta) — px (t1) ex (te). 


Proof. Plugging in the definition, we have that 


Cx (ti, t2) = E[X (ti) X (te) — X(t1) ux (t2) — X (te) ex (t1) + ex (tr) ex (t2)] 
= Rx (ti, to) — px (tr) ox (te) — wx (tr) pox (te) + ox (tr) ex (te) 
= Rx (ti, te) — wx (ti) Ux (te) 


Practice Exercise 10.2. If X(t) = Acos(2mt) for A ~ Uniform|0, 1], find Cx (t1, tz). 
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Solution. 


1 1 1 
Cx(t1, te) = 3 cos(2nt1) cos(27t2) — 5 cos(27t1) - 5 cos(27t2) 


1 
=o cos(27t,) cos(27t2). 


Practice Exercise 10.3. Suppose X(t) = cos(wt + ©) for 0 ~ Uniform[—7, a]. Find 
Cx (tr, ta). 


Solution. 
Cx (ti, t2) = Rx (ti, tz) — bx (t1) ux (te) 


= 500s (u(t = t»)) —0-0= 5008 (w(t = i»). 


In some problems we are interested in the correlation between two random processes 
X(t) and Y(t). This gives us the cross-correlation and the cross-covariance functions. 


Definition 10.4. The cross-correlation function of X(t) and Y(t) is 


Rx,y (ti, te) = E[X(tH)Y (t2)). 


Definition 10.5. The cross-covariance function of X(t) and Y(t) is 


Cx,y (ti, t2) = E[(X (ti) — wx (t1)) (Y (te) — py (te))]. 


Remark. If [ex (t1) => jy (ta) => 0, then Cy y(t, ta) => Rxy(h, tz) = i [X (t1)Y (t2)]. 


10.2.3. Independent processes 


How do we establish independence for two random processes? We know that for two random 
variables to be independent, the joint PDF can be written as a product of two PDFs: 


fx,y (x,y) = fx (x) fy (y). (10.11) 


If we extrapolate this idea to random processes, a natural formulation would be 


fx),va) (2, y) = fxay(®) fy (y)- (10.12) 


But this definition has a problem because X(t) and Y(t) are functions. It is not enough to 
just look at one time index, say t = to. The way to think about this situation is to consider 
a pair of random vectors X and Y. When you say X and Y are independent, you require 
fx.y(a@,y) = fx(x)fy(y). The PDF fx(a) itself is a joint distribution, ie., fx(x) = 
fx,,....Xn(1,---,@Nn). Therefore, for random processes, we need something similar. 
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Definition 10.6. Two random processes X(t) and Y(t) are independent if for any 
Pita, 


Pee Veit vine aN) 
Mi fi oN) x fy (t1) ¥ (tw) (Yi Sue): 


This definition is reminiscent of fx y(x,y) = fx(«x)fy(y). The requirement here is that 
the factorization holds for any N, including very small N and very large N, because X (¢) 
and Y(t) are infinitely long. 


Independence means that the behavior of one process will not influence the behavior 
of the other process. We define uncorrelated as follows. 


Definition 10.7. Two random processes are X(t) and Y(t) uncorrelated if 


E [X(ti)Y (t2)| = E[X(t1)| E[Y (¢2)], (10.13) 


Independence implies uncorrelation, as we can see from the following. If X(t) and Y(t) are 
independent, it follows that 


2 [X(t)¥ (to)] = f x, 9¥( Y (to, C)fxy (E,0) dé de 
= f Xt.8) Y (ta, C)fx(é)\fy() dé d¢, independence 


= f xtu.8)fe © a [ ¥(2,0fr6) de = BX (IELt)] 


If two random processes are uncorrelated, they are not necessarily independent. 


Independent X and Y sf uncorrelated X and Y 


Example 10.13. Let Y(t) = X(t) + N(t), where X(t) and N(t) are independent. 
Then 


Rx,y (t1, te) = E[X(t1) Y (t2)] = E[X (th) (X (ta) + N(t2))] 
= Rx(ti, te) + x (ti) un (t2). 


10.3. Wide-Sense Stationary Processes 


10.3. WIDE-SENSE STATIONARY PROCESSES 


As we have seen in the previous sections, some random processes have a “nice” autocor- 
relation function, in the sense that the 2D function Rx(t,,t2) has a Toeplitz structure. 
Random processes with this property are known as wide-sense stationary (WSS) processes. 
WSS processes belong to a very small subset in the entire universe of random processes, 
but they are practically the most useful ones. Before we discuss how to use them, we first 
present a formal definition of a WSS process. 


10.3.1 Definition of a WSS process 


Definition 10.8. A random process X(t) is wide-sense stationary if: 


1. x(t) = constant, for all t, and 
2. Rx (ti, t2) = Rx(ti = ta) for all ti, ta. 


There are two criteria that define a WSS process. The first criterion is that the mean is a 
constant. That is, the mean function does not change with time. The second criterion is that 
the autocorrelation function only depends on the difference t; — tz and not on the absolute 
starting point. For example, Rx (0.1, 1.1) needs to be the same as Rx (6.3, 7.3), because the 
intervals are both 1. 

How can these two criteria be mapped to the Toeplitz structure we discussed in the 
previous examples? Figure 10.14 shows the autocorrelation function Rx (ti,t2), which is a 
2D function. We take three cross sections corresponding to tg = —0.13, tg = 0 and tg = 0.3. 
As you can see from the figure, each Rx (ty, t2) is a shifted version of another one. To obtain 
any value Rx(ti,t2) on the function, there is no need to probe to the 2D map; you only 
need to probe to the red curve and locate the position marked as t, — tg, and you will be 
able to obtain the value Rx (t1, te). 


1 


t, = -0.13 
—t, =0 


-0.5 0.5 I\ t.=03 4 
2 = 0. 
-0.25 ly, 
0.25 
-0.5 - 4 


1 -1 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 1 
“1 -0.75 -0.5 -0.25 0 0.25 05 075 1 t 
t 1 


o 


Figure 10.14: Cross sections of the autocorrelation function Rx (t1, t2) = + COs (w(t — ta). 


Not all random processes have a Toeplitz autocorrelation function. For example, the 
random process X(t) = Acos(27t) is not a WSS process, because the autocorrelation func- 


1Many textbooks introduce strictly stationary processes before discussing a wide-sense stationary process. 
We skip the former because, throughout our book, we only use WSS processes. Readers interested in strictly 
stationary processes can consult the references listed at the end of this chapter. 


633 


CHAPTER 10. RANDOM PROCESSES 


tion is 
1 
Rx (ti, t2) = 3 cos(27t,) cos(27t2), 


which cannot be written as the difference t; — te. 


Remark 1. WSS processes can also be defined using the autocovariance function instead 
of the autocorrelation function, because if a process is WSS, then the mean function is a 
constant. If the mean function is a constant, then Cx(t1,t2) = Rx(ti,t2) — w?. So any 
geometric structure that Rx possesses will be translated to Cx, as the constant ju? will not 
influence the geometry. Therefore, it is equally valid to say that a WSS process has 


Cx (ti, t2) = Cx (ti — te). 


Remark 2. Because a WSS is completely characterized by the difference t; — tz, there is 
no need to keep track of the absolute indices t; and tg. We can rewrite the autocorrelation 
function as 


Rx(r) = E[X(t+7)X(0)]. (10.14) 


There is nothing new in this equation: It only says that instead of writing Rx(t+7,t), we 
can write Rx(rT) because the time index ¢ plays no role in terms of Rx. Thus from now on, 
for any WSS processes we will write the autocorrelation function as Rx(rT). 


10.3.2 Properties of Ry(7T) 
When X(t) is WSS, Rx (7) has several important properties. 


Corollary 10.1. Rx (0) = average power of X(t). 


Proof. Since 


Rx(0) = E[X(¢ + 0)X(t)] = E[X(4)’], 


and since E[X(t)?] is the average power, Rx (0) is the average power of X (t). 


Corollary 10.2. Rx (7) is symmetric. That is, Rx (7) = Rx(-T). 


Proof. Note that Rx(r) = E[X(t+7)X(t)]. By switching the order of multiplication in the 
expectation, we have 


i [X(t + 7)X(t)] = ELX(t) X(t + 7)] = Rx(-7). 


Corollary 10.3. 
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This result says that if Rx (7) is slowly decaying from Rx (0), the probability of having a 
large deviation |X (t+ 7) — X(r)| is small. 


Proof. 
P(|X(¢-+7) — X(7)| >) SE[(X(¢ +7) — X(r))"/e? 
a[X(t+ 7)"] - QELX(t +7) X()] + EX") /e 


Corollary 10.4. |Rx(r)| < Rx(0), for all r. 


Proof. By Cauchy’s inequality E[XY]? < E[X?]E[Y?], we can show that 
Rx(r)* = E[X(t)X(t +7)? 
< E[X (t)JE[X(¢ + 7)’] 


= E[X(t)"|? = Rx(0)". 


10.3.3. Physical interpretation of Rx (7) 


How should we understand the autocorrelation function Rx(r) for WSS processes? Cer- 
tainly, by definition, Rx (7) = E[X(t + r)X(t)] means that we can analyze Rx(r) from the 
statistical perspective. But in this section we want to take a slightly different approach by 
answering the question from a computational perspective. 

Consider the following function: 


Rx(r) = = [. X(t +7)X(t) dt. (10.15) 


This function is the temporal average of X(t+7).X(t), as opposed to the statistical average. 
Why do we want to consider this temporal average? We first show the main result, that 
E[Rx(r)] = Rx (7). 


7 X(t +7)X(t) dt. Then 


Proof. 
_ T 
E [Rx (7)| = op f BIXe+ 7x) dt 
v4 T 
= oe gy at=Rx()sq f dt = Rx( ). 
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This lemma implies that if the signal X(t) is long enough, we can approximate Rx (rT) 
by Rx(r). The approximation is asymptotically consistent, in the sense that E[Rx(r)] = 
Rx(r). Now, the more interesting question is the interpretation of Rx (7). What is it? 


How should we understand 2x (7)? 


Rx(r) is the “unflipped convolution”, or correlation, of X(r) and X(t +7). 


Correlation is analogous to convolution. For convolution, the definition is 

p 

Y(r) -| X(t —7)X(t) dt, (10.17) 
= 

whereas for correlation, the definition is 

T 

Y(r) =| X(t+7)X(t) dt. (10.18) 
-T 


Clearly, R x(r) is the latter. A graphical illustration of the difference between convolution 
and correlation is provided in Figure 10.15. The only difference between the two is that the 
correlation does not flip the function, whereas the convolution does flip the function. 


0.6 0.6 

05+ 0.5 + 

0.4 > 0.4/ 

0.3 0.3; 

0.2 - | | oz 

0.1 - | } OF | 
0|-0-0-0-0-0-0-6-0-e-!!_ lites 0 or rerrs 

0.1 -0.1 
-10 5 0 5 10 -10 5 0 5 10 

(a) Convolution (b) Correlation 


Figure 10.15: The difference between convolution and correlation. In convolution, the function X(t) is 
flipped before we compute the result. For correlation, the function is not flipped. 


The temporal correlation is easy to visualize. Starting with the function X(t+7), if you 
make 7 larger or smaller, then effectively you are shifting X(t) left or right. The integration 
ee (t + 7)X(t) dt calculates the energy accumulated. If the integral is large, there is a 
strong correlation between X(t) and X(t+ 7). Otherwise the correlation is small. Here is 
an extreme example: 
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Example 10.14. Consider a random process X(t) such that for every t, X(t) is an 
ii.d. Gaussian random variable with zero mean and unit variance. Then 


[X?(0)], 7=0, 
OLX (t+ 7)JELX(t)], 70: 


per ixtol={ 


Using the fact that X(t) is iid. Gaussian for all t, we can show that E[X?(t)] = 1 for 
any t, and E[X(t + r)|E[X(t)] = 0. Therefore, we have 


710, 


6 = 0: 


The equation says that since the random process is i.i.d. Gaussian, shifting and in- 
tegrating will give maximum correlation at the origin. As soon as the shift is not at 
the origin, the correlation is zero. This makes sense because the samples are just i.i.d. 
Gaussian. One pixel offset is enough to destroy any correlation. 

Now let’s calculate the temporal correlation. We know that 


fs 
Rx(T) = ap | XOX dr. 


This equation says that we shift X(t) to the left and right and then integrate. If 7 
is not zero, the product X(t + 7)X(t) will sometimes be positive and sometimes be 
negative. After integrating the entire period, we cancel out most of the terms. Let’s 
plot the functions and see if all these steps make sense. In Figure 10.16(a), we show 
two random realizations of the random process X(t). They are just ii.d. Gaussian 
samples. x 

In Figure 10.16(b) we plot the temporal autocorrelation function Rx (rT). Since 
Rx(r) itself is a random process, it has different realizations. We plot two random 
realizations, which are computed based on shifting and integrating X(t). In the same 
plot, we also show the statistical expectation Rx(r). As we can see from the plot, 
the temporal correlation and the statistical correlation match reasonably well except 
for the fluctuation in Rx(r), which is expected because it is computed from a. finite 
number of samples. 
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correlation of sample 1 
correlation of sample 2 
—— auto-correlation function| ~ 


(a) X(¢) 


Figure 10.16: (a) A random process X(t) with two different realizations. (b) As we calculate the 
temporal correlation of each of the two realizations, we obtain a noisy function that is nearly an 
impulse. If we take the average of many of these realizations, we obtain a pure delta function. 


On a computer, the commands to do the autocorrelation function are xcorr in MAT- 
LAB and np.correlate in Python. Below are the codes used to generate Figure 10.16. 


% MATLAB code to demonstrate autocorrelation 
= 1000; % number of sample paths 
= 1000; % number of time stamps 
1*randn(N,T); 
zeros (N,2*T-1) ; 
for i=1:N 
xc(i,:) = xcorr(X(i,:))/T; 
end 
plot(xc(1,:),’b:’, ’?LineWidth’, 2); hold on; 
plot(xc(2,:),’k:’, ’LineWidth’, 2); 


# Python code to demonstrate autocorrelation 
1000 


1000 
np.random.randn(N,T) 
= np.zeros((N,2*T-1)) 
for i in range(N): 
xc[i,:] = np.correlate(X[i,:],X([i,:],mode=’full’)/T 
plt.plot(xc[0,:],’b:’) 
plt.plot(xc[1,:],’k:’) 
plt.show() 


Under what conditions will Rx (7) + Rx(r) as T — co? The answer to this question 
is provided by an important theorem called Mean-Square Ergodic Theorem, which can be 
thought of as the random process version of the weak law of large numbers. We leave the 
discussion of the mean ergodic theorem to the Appendix. 
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Everything you need to know about a WSS process 
e The mean of a WSS process is a constant (does not need to be zero) 


e The correlation function only depends on the difference, so Rx (ti, t2) is Toeplitz. 


e You can write Rx(ti,t2) as Rx(T), where rT = ty — ta. 


e Rx(r) tells you how much correlation you have with someone located at a time 
instant 7 from you. 


10.4 Power Spectral Density 


Beginning with this section we are going to focus on WSS processes. By WSS, we mean that 
the autocorrelation function Rx (t1,t2) has a Toeplitz structure. Putting it in other words, 
we assume Rx(t,,t2) can be simplified to Rx(r), where rT = ty — tg. We call this property 
time invariance. 

10.4.1 Basic concepts 

Assuming that Rx (7) is square integrable, i.e., [°, Rx(r)? dr < co, we can now define the 


Fourier transform of Rx(r) which is called the power spectral density. 


Theorem 10.2 (Einstein-Wiener-Khinchin Theorem). The power spectral density 
Sx(w) of a WSS process is 


Sen / ” Rx(r) e-H dr = F(Rx(n)), 


assuming that [° Rx(r)? dr < 00 so that the Fourier transform of Rx(r) exists. 


Practice Exercise 10.4. Let Rx(r) = e~20'7!. Find Sx (w). 


Solution. Using the Fourier transform table, 


4a 


Sx(w) =F {Rx(T)} = Aer asee 


Figure 10.17 shows the autocorrelation function and the power spectral density. 
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1 2 -10 5 0 


Figure 10.17: Example for Rx (r) = e~2'7|, with a= 1. 


Why is Theorem 10.2 a theorem rather than a definition? This is because power spectral 
density has its definition. There is no way that you can get any “power” information merely 
by looking at the Fourier transform of Rx (7). We will discuss the origin of the power spectral 
density later, but for now, we only need to know that Sy(w) is the Fourier transform of 
Rx(r). 


Remark. The power spectral density is defined for WSS processes. If the process is not 
WSS, then Rx will be a 2D function instead of a 1D function of 7, so we cannot take the 
Fourier transform in 7. We will discuss this in detail shortly. 


Practice Exercise 10.5. Let X(t) = acos(wot+0), O ~ Uniform(0, 27]. Find Sx (w). 


Solution. We know that the autocorrelation function is 
2 


Rx(r) = cos(woT) 


a2 eJ¥oT dt eI woT 
a (3), 


By taking the Fourier transform of both sides, we have 


Sx (w) 


a? = — wo) + 27d(w + ~)) 
2 2 


d(w + wo)] . 


The result is shown in Figure 10.18. 
0.5 
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Figure 10.18: Example for Rx (rT) = ce cos(woT), with a = 1 and wo = 2n. 


Practice Exercise 10.6. Let Sx(w) = Srect(;#;). Find Rx(r). 


Solution. Since Sx(w) = F(Rx(r)), the inverse holds: 
N 
RseGr) = Dee 
2a 


This example shows what we call the bandlimited white noise. The power spectral 
density Sx(w) is uniform, meaning that it covers all frequencies (or wavelengths in 
optics). It is called “white noise” because white light is essentially a mixture of all 
wavelengths. 

The bandwidth of the power spectral density W defines the zero crossings of 
Rx(r). It is easy to show that when W > oo, Rx(r) converges to a delta function. 
This happens when X(t) is i.i.d. Gaussian. Therefore, the pure Gaussian noise random 
process is also known as the white noise process. Reshaping the i.i.d. Gaussian noise 
to an arbitrary power spectral density can be done by passing it through a linear filter, 
as we will explain later. 


1.5 
A, (7) | —S,(w) 
1 I 


0.57 


0 


0.5 
“10 5 0 5 


*orect(s4-), with No = 2 and W =5. 


Finding Sx (w) from Rx(r) is straightforward, at least in principle. The more inter- 
esting questions to ask are: (1) Why do we need to learn about power spectral density? (2) 
Why do we need WSS to define power spectral density? 


How is power spectral density useful? 


e Power spectral densities are useful when we pass a random process through some 
linear operations, e.g., convolution, running average, or running difference. 


e Power spectral densities are the Fourier transforms of the autocorrelation func- 
tions. Fourier transforms are useful for speeding up computation and drawing 
random samples from a given power spectral density. 


A random process itself is not interesting until we process it; there are many ways to do 
this. The most basic operation is to send the random process through a linear time-invariant 
system, e.g., a convolution. Convolution is equivalent to filtering the random process. For 
example, if the input process contains noise, we can design a linear time-invariant filter to 
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denoise the random process. The power spectral density, which is the Fourier transform 
of the autocorrelation function, makes the filtering easier because everything can be done 
in the spectral (Fourier) domain. Moreover, we can analyze the performance and quantify 
the limit using standard results in Fourier analysis. For some specialized problems such as 
imaging through atmospheric turbulence, the distortions happen in the phase domain. This 
can be simulated by drawing samples from the power spectral density, e.g., the Kolmogorov 
spectrum or the von Karman spectrum. Power spectral densities have many important 
engineering applications. 


Why does the power spectral density require wide-sense stationarity? 
e If a process is WSS, then Rx will have a Toeplitz structure. 


A Toeplitz matrix is important. If you do eigendecomposition to a Toeplitz ma- 
trix, the eigenvectors are the Fourier bases. 


So if Rx is Toeplitz, then you can diagonalize it using the Fourier transform. 


Therefore, the power spectral density can be defined. 


Why does power spectral density require WSS? This has to do with the Toeplitz 
structure of the autocorrelation function. To make our discussion easier let us discretize 
the autocorrelation function Rx(t1,t2) by considering Rx[m,n]. (You can do a mental 
calculation by converting t; to integer indices m, and tg to n. See any textbook on signals 
and systems if you need help. This is called the “discrete time signal” .) Following the range 
of t; and ta, Rx|m,n] can be expressed as: 


Rx [0] Rx(1] Hx | = 1] 
Rx [1] Rx [0] Rx[N — 2] 
R = i * 4 ? 
Rx{N-1] Rx(N-1) + — Rx(0 


where we used the fact that Rx [m,n] = Rx|[m—n] for WSS processes and Rx [k] = Rx[—k]. 
We call the resulting matrix R the autocorrelation matrix, which is a discretized version 
of the autocorrelation function Rx (t1,t2). Looking at R, we again observe the Toeplitz 
structure. For example, Figure 10.20 shows one Toeplitz structure and one non-Toeplitz 
structure. 

Any Toeplitz matrix R can be diagonalized using the Fourier transforms. That is, we 
can write R as 

R=F"AF, 


where F is the (discrete) Fourier transform matrix and A is a diagonal matrix. This can be 
understood as the eigendecomposition of R. The important point here is that only Toeplitz 
matrices can be eigendecomposed using the Fourier transforms; an arbitrary symmetric 
matrix cannot. Figure 10.20 illustrates this point. If your matrix is Toeplitz, you can diago- 
nalize it, and hence you can define the power spectral density, just as in the first example. If 
your matrix is not Toeplitz, then the power spectral density is undefined. To get the Toeplitz 
matrix, you must start with a WSS process. 

Before moving on, we define cross power spectral densities, which will be useful in 
some applications. 
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a3 3 Inverse eae om aes 


Figure 10.20: We show two autocorrelation functions Rx [m,n] on the left-hand side. The first autocor- 
relation function comes from a WSS process that has a Toeplitz structure. The second autocorrelation 
function does not have Toeplitz structure. For the Toeplitz matrix, we can diagonalize it using the 
Fourier transform. The eigenvalues are the power spectral density. 


Definition 10.9. The cross power spectral density between two random processes 
X(t) and Y(t) is 


Sx y(w) =F(Rxy(T)) where Rx y(7T) =E[X(t+7)Y()], 
Sy,x(w)=F(Ryx(7)) where Ryx(7)=ElY +74): 


(10.19) 


Remark. In general, Sx y(w) # Sy,x(w). Rather, since Rx y(t) = Ry,x(—T), we have 
Sx,y(w) = Sy,x(w). 


10.4.2 Origin of the power spectral density 


To understand the power spectral density, it is crucial to understand where it comes from 
and why it is the Fourier transform of the autocorrelation function. 

We begin by assuming that X(t) is a WSS random process with mean zx and auto- 
correlation Rx (rT). We now consider the notion of power. Consider a random process _X (t). 
The power within a period [—T,T] is 

se 1 f* 


Py =— X(t)? dt. 
x= apf OP ae 


Px defines the power because the integration alone is the energy, and the normalization by 
1/2T gives us the power. However, there are two problems. First, since X(t) is random, the 
power Px is also random. Is there a way we can eliminate the randomness? Second, T is 
a finite period of time. It does not capture the entire process, and so we do not know the 
power of the entire process. 

A natural solution to these two problems is to consider 


T 
Py & | im val |X (t)|? “| ; (10.20) 
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Here, we take the limit of T to infinity so that we can compute the power of the entire 
process. We also take the expectation to eliminate the randomness. Therefore, Px can be 
regarded as the average power of the complete random process X(t). 

Next, we need one definition and one lemma. The definition defines Sx(w), and the 
lemma will link Sx(w) with the power Px. 


Definition 10.10. The power spectral density (PSD) of a WSS process is defined as 


E [| Xr(w)P| 


10.21 
— (10.21) 


X7(w) = [ xo dt (10.22) 


is the Fourier transform of X(t) limited to [-T,T). 


This definition is abstract, but in a nutshell, it simply considers everything in the Fourier 
domain. The ratio |X 7(w)|?/2T is the power, but in the frequency domain. The reason is 
that if X(t) is Fourier transformable, then Parseval’s theorem will hold. Parseval’s theorem 
states that energy in the original space is conserved in the Fourier space. Since the ratio 
|X7(w)|?/2T is the energy divided by time, it is the power. However, this is still not enough 
to help us understand power spectral density: We need a lemma. 


Lemma 10.2. Define 


The lemma has to be read together with the previous definition. If we can prove the lemma, 
we know that by integrating Sx(w) we will obtain the power. Therefore, Sx(w) can be 
viewed as a density function, specifically the density function of the power. Sx (w) is called 
the power spectral density because everything is defined in the Fourier domain. Putting this 
all together gives us “power spectral density”. 


Proof. First, we recall that Px is the expectation of the average power of X(t). Let 


- f 2G). =P Ste7. 
Xr(t) = { 0 otherwise. 


It follows that integrating over —oo to oo is equivalent to 


ee) T 
‘ |Xr(t)|? dt = i |X (t)|? dt. 
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By Parseval’s theorem, energy is conserved in both the time and the frequency domain: 


Co 1 Co 1s 
[ bxr@Pat= =f Rew)? de. 


—co 


Therefore, Px satisfies 


7 . 1 Cos 2 
=e [tim Soe fr? al 
7? 5. Te Pe 2 xe 
[gin rere a 

sx (w) 


The power spectral densities are functions whose integrations give us the power. If we 
want to determine the power of a random process, the Einstein-Wiener-Khinchin theorem 
(Theorem 10.2) says that S'x(w) is just the Fourier transform of Rx (r): 


Sx(w) = i Rx(r) e 9% dr = F(Rx(r)). 


The proof of the Einstein-Wiener-Khinchin theorem is quite intricate, so we defer 
the proof to the Appendix. The significance of the theorem is that it turns an abstract 
quantity, the power spectral density, into a very easily computable quantity, namely the 
Fourier transform of the autocorrelation function. For now, we will happily use this theorem 
because it saves us a great deal of trouble when we want to determine the power spectral 
density from the first principles. 
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10.5 WSS Process through LTI Systems 


Random processes have limited usefulness until we can apply operations to them. In this 
section we discuss how WSS processes respond to a linear time-invariant (LTI) system. This 
technique is most useful in signal processing, communication, speech analysis, and imaging. 
We will be brief here since you can find most of this information in any standard textbook 
on signals and systems. 


10.5.1 Review of linear time-invariant systems 


When we say a “system”, we mean that there exists an input-output relationship as shown 
in Figure 10.21. 


X(t) System Y(t) 


Figure 10.21: A system can be viewed as a black box that takes an input X(t) and turns it into an 
output Y(t). 


Linear time-invariant (LT!) systems are the simplest systems we use in engineering 
problems. An LTI system has two properties. 


e Linearity. Linearity means that when two input random processes are added and 
scaled, the output random processes will also be added and scaled in exactly the 
same way. Mathematically, linearity says that if X1(¢) > Yi(t) and X2(t) > 
Y(t), then 


e Time-invariant: Time invariance means that if we shift the input random process 
by a certain time period, the output will be shifted in the same way. Mathemat- 
ically, time invariance means that if X(t) > Y(t), then 


X(t+7) > Y(t+7). 


If a system is linear time-invariant, the input-to-output relation is given by convolution: 


The convolution between two functions X(t) and h(t) is defined as 


co 


Y(t) = A(t) * X(t) =| h(r) X(t —7) dr, 


—co 


in which we call h(t) the system response or impulse response. 
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The function h(t) is called the impulse response because if X(t) = 6(t), then according to 
the convolution equation we have 


Y(t) = [- h(r) 6(t — 7) dr = A(t). 


Therefore, if we send an impulse to the system, the output will be h(t). 
Convolution is commutative, meaning that h(t) * X(t) = X(t) * h(t). Written as inte- 
grations, we have 


i h(r) X(t —7) dr = . h(t— 7) X(7) dr. (10.24) 


—oco —oco 


For LTI systems, Y(t) can be determined through the Fourier transforms. 


The Fourier transform of a (squared-integrable) function X(t) is 


Se SIO = i, * xX (7) en#** dr. 


A basic property of convolution is that convolution in the time domain is equivalent to 
multiplication in the Fourier domain. Therefore 


Y (w) = H(w)X(w), (10.26) 


where H(w) = F{h(t)} is the Fourier transform of h(t), and Y(w) = F(Y(¢)) is the Fourier 
transform of Y(t). 

In the rest of this section we study the pair of input and output random processes that 
are defined as follows 


e X(t) = input. It is a WSS random process. 


e Y(t) = output. It is constructed by sending X(t) through an LTI system with 
impulse response h(t). Therefore, Y(t) = h(t) * X(t). 


10.5.2 Mean and autocorrelation through LTI Systems 


Since X(t) is WSS, the mean function of X(t) stays constant, i.e., x(t) = x. The following 
theorem gives the mean function of the output. 


Theorem 10.3. If X(t) passes through an LTI system to yield Y (t), the mean function 
of Y(t) is 


CO 


OY (t)] = ux i h(r) dr. (10.27) 


—co 
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Proof. Suppose that Y(t) = A(t) * X(t). Then, 
p(t) = EVO) =B| f r(r)x(e~7) a] 
= a h(r)E[X(t — 7)] dr = [. a ee eee [. OES 


where the second to last equality is valid because E[X(t — 7)] = px. 

The theorem suggests that if the input X(t) has a constant mean, the output Y(¢) 
should also have a constant mean. This should not be a surprise because if the system is 
linear, a constant input will give a constant output. 


Example 10.15. Consider a WSS random process X(t) such that each sample is an 
i.i.d. Gaussian random variable with zero mean and unit variance. We send this process 
through an LTI system with impulse response h(t), where 


10(1 — |t —1<t<1 
0, otherwise. 


The mean function of X(t) is ux(t) = 0, and that of Y(t) is wy (t) = 0. Figure 10.22 
illustrates a numerical example, in which we see that the random processes X(t) and 
Y(t) have different shapes but the mean functions remain constant. 


S 0 5 : 
(a) wx (t) and py (t) (b) Rx(t) and Ry (t) 


Figure 10.22: When sending a WSS random process through an LTI system, the mean and the 
autocorrelation functions are changed. 


Next, we derive the autocorrelation function of a random process when sent through 
an LTI system. 


Theorem 10.4. If X(t) passes through an LTI system to yield Y(t), the autocorre- 
lation function of Y(t) is 


Ry(T) = [- i h(s)h(r)Rx(7 +8 —1) ds dr. (10.28) 
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Proof. We start with the definition of Y(t): 


Ry(r) = E[Y()Y(¢ +7) 


7 ff ea kte=2) as [ MAKES a) ap 


—oo 


2 Le ie h(s)h(r)E [X(t — s)X(t+7—1) ds dr] 
= [. [. h(s)h(r)Rx(7 + 8 —1) ds dr, 


where in (a) we assume that integration and expectation are interchangeable. 


A shorthand notation of the above formula is Ry (t) = [h@(h*Rx)|(t), where x denotes 
the convolution and ® denotes the correlation. Figure 10.22(b) shows the autocorrelation 
functions Rx and Ry. In this example Rx is a delta function because for i.i.d. Gaussian 
noise the power spectral density is a constant. After convolving with the system response, 
the autocorrelation Ry has a different shape. 


10.5.3 Power spectral density through LTI systems 
Denoting the Fourier transform of the impulse response by H(w) = F(h(t)), we derive the 


power spectral density of the output. 


Theorem 10.5. If X(t) passes through an LTI system to yield Y (t), the power spec- 
tral density of Y(t) is 


Sy (w) = |H(w)|?Sx(w). (10.29) 


Proof. By definition, the power spectral density Sy(w) is the Fourier transform of the 
autocorrelation function Ry (w). Therefore, 


Sy(w) = a Ry(r)e~9*? dr 


= / / i h(s)h(r)Rx (7 +8 —r) ds dre~J*" dr. 


Letting u=7+s5—~r, we have 


Sy (w) = / : i h(s)h(r)Rx (ue I#—8+") ds dr du 


/ h(s)e7”8 as | h(r)eJe" ar [ Rx(uje7*" du 


A (w)H(w)Sx(w), 


where H(w) is the complex conjugate of H(w). 


It is tempting to think that since Y(t) = h(t) * X(t), the power spectral density should 
also be Sy(w) = H(w)X(w), but this is not true. The above result shows that we need an 
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additional complex conjugate H(w) because Sy(w) is the power, which means the square 
of the signal. Note that Rx is “squared” because we have convolved it with itself, and Ry 
is also squared. Therefore, to match Rx and Ry, the impulse response h also needs to be 
squared in the Fourier domain. 


Example 10.16. A WSS process X(t) has a correlation function 
Rx (rT) = sinc(z7). 
Suppose that X(t) passes through an LTI system with input/output relationship 


(t) (t) = 3 GX()— 8 EXO) + OX Ne 
Find Ry(r). 


Solution: The sinc function has a Fourier transform given by 


sinc(Wt) ¢ = ” rect ( 


Ww a 


Therefore, the autocorrelation function is 


Rx(r)=sine(nr) <> “rect (+) 
x (7) = sinc(7T = a Sn) 


By taking the Fourier transform on both sides, we have 


Te Si Sm, 


elsewhere. 


The system response is found from the differential equation: 


3(jw)? — 3Gw) + 6 
2(jw)? + 2(jw) +4 
3 [(2 — w?) — jw] 
2[(2—w?) + ju] 


H(w) = 


Taking the magnitude square yields 


Therefore, the output power spectral density is 


Sy (w) = |H(w)[?Sx(w) = 7Sx(W). 
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Taking the inverse Fourier transform, we have 


Ry G) = Gsine(77). 


Example 10.17. A random process X(t) has zero mean and Rx(t,s) = min(t,s). 
Consider a new process Y(t) = e'X(e~”*). 


1. Is Y(t) WSS? 
2. Suppose Y(t) passes through a LTI system to yield an output Z(t) according to 


Sati + 2Z(t) = “¥(t) + Y(t). 


Find Rz(r). 


Solution: 


1. In order to verify whether Y(t) is WSS, we need to check the mean function and 
the autocorrelation function. The mean function is 


r= Ble xe) eh | xe"). 


Since X(t) has zero mean, E[X(t)] = 0 for all t. This implies that if u = e~*", 
then E[X(u)] = 0 because wu is just another time instant. Thus E[X (e~‘)] = 0, 
and hence E[Y (t)] = 0. 


The autocorrelation is 


z(Y(t+7)Y(t)] =E ett ix(er tte | 


| 7-1 


— 'TRy 
Substituting Rx(t, s) = min(t,s), we have that 


ewttT Ry (e247) eo) = e2ttr min(e~2(¢+7) | ae) 

eof ee SU 
Cm T<0 

See ew 

ay, T<0 


T| 


—= eal 


So Ry(r) = e7!7!. Since Ry(r) is a function of 7, Y(t) is WSS. 
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2. The system response is given by 


The magnitude is therefore 


is 


H(w)|? = . 
|H(w)| Tae 


Hence, the output autocorrelation function is 


= DP) 
1+?’ 


Ry(r) =e"! +4 Sy(w) 


Sz(w) = |H(w)|’Sy(w) 
l+w? 2 2 


> ape ieee Aue 


Therefore 1 
Rz(r) = so 


10.5.4 Cross-correlation through LT! Systems 


The above analyses are developed for the autocorrelation function. If we consider the cross- 
correlation between two random processes, say X(t) and Y(t), then the above results do not 
hold. In this section, we discuss the cross-correlation through LTI systems. 

To begin with, we need to define WSS for a pair of random processes. 


Definition 10.11. Two random processes X(t) and Y(t) are jointly WSS if 
1. X(t) is WSS and Y(t) is WSS, and 


2. Rx y(ti, te) = E[X(t1)Y (te)] ts a function of ti — ta. 


If X(t) and Y(t) are jointly WSS, we write 


Rx,y (ti, t2) = Rx,y(7T) = E[X(¢+7)Y(r)]. 


The definition of “jointly WSS” is necessary here because R.x,y is defined by X and Y. Just 
knowing that X(t) and Y(t) are WSS does not allow one to say that Rx y(ti,t2) can be 
written as the time difference. 

If we flip the order of X and Y to consider Ry,x(r) and not Rx,y(r), then we need 
to flip the argument. The following lemma explains why. 
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Lemma 10.3. For any random processes X(t) and Y(t), the cross-correlation Rx y (rT) 


is related to Ry,x(T) as 


Rx,y (7) = Ry,x(-7). (10.30) 


Proof. Recall the definition of Ry.x(—T) = E[Y(t—1)X(t)]. This can be simplified as 
follows: 


Ry.x(-T) =E[Y(t—17)X(t) 


where we substituted t/ = t —T. 


Example 10.18. Let X(t) and N(t) be two independent WSS random processes with 
expectations E[X (t)] = ux and E[N(t)] = 0, respectively. Let Y(t) = X(t) + N(t). We 
want to show that X(t) and Y(t) are jointly WSS, and we want to find Rx y(r). 


Solution. Before we show the joint WSS property of X(t) and Y(t), we first show 
that Y(t) is WSS: 


[Y(¢)] =E 
Ry (t1,t2) =E X (t2) + N(t2))] 
= i [(N (t1) N (t2)] 
= Rx(ti — te) + Rn (ti — te). 


Thus, Y(t) is WSS. 
To show that X(t) and Y(t) are jointly WSS, we need to check the cross- 
correlation function: 


Rx y (ti, t2) = E[X(t)Y (t2)] 
=E[X(t1)(X(te) +N 
= E[X(t1)(X (t2)] + 

= Rx (ti, te) + E[X (t1)|E[N (t2)] 

= Rx(t1, te). 


Since Rx y(t1, t2) is a function of t; — tz, and since X(t) and Y(t) are WSS, X(t) and 
Y(t) must be jointly WSS. 


Finally, to find Rx y (rT), we substitute 7 = t; — tz and obtain Rx y(T) = Rx(r). 


Knowing the definition of jointly WSS, we consider the cross-correlation between X (t) 
and Y(t). Note that here we are asking about the cross-correlation between the input and 
the output of the same LTT system, as illustrated in Figure 10.23. The pair X(t) and 
Y(t) = h(t) * X(t) are special because Y(t) is the convolved version of X(t). 
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X(t) System y) 
Ry,x(T) 
Rx(T) Rx,y(T) Ry(r) 


Figure 10.23: The source of the signals when defining Rx(7), Rx,y(7), Ry,x(7) and Ry(r). 


Theorem 10.6. Let X(t) and Y(t) be jointly WSS processes, and let Y(t) = h(t) x 
X(t). Then the cross-correlation Ry,x(T) is 


Ry,x(r) = h(r) * Rx(r). (10.31) 


Proof. Recalling the definition of cross-correlation, we have 


Ry.x(r) =E[Y(t+7)X(0)] 
= [xt f- — 


=f aLX (t)X(t +7 — 1) nr) r= f Rx(t —r)h(r) dr, 


which is the convolution Ry,x (rT) = h(r) * Rx(r). 


We next define the cross power spectral density of two jointly WSS processes as the 
Fourier transform of the cross-correlation function. 


Definition 10.12. The cross power spectral density of two jointly WSS processes 
X(t) and Y(t) is defined as 


The relationship between Sx y and Sy,x can be seen from the following theorem. 


Theorem 10.7. For two jointly WSS random processes X(t) and Y(t), the cross 
power spectral density satisfies the property that 


Sx,y(w) = Sy,x(w), (10.32) 


where (-) denotes the complex conjugate. 
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Proof. Since Sx y(w) = F[Rx,y(rT)] by definition, it follows that 


F[Rxy(T)] = [- Rx y(r)e4°" dt 


—oco 


-| Ry,x(—r)e3“" dr -| Rx y(r')e*” dr’, 


which is exactly the conjugate Sy,x (w). 


When sending the random process through an LTI system, the cross-correlation power 
spectral density is given by the theorem below. 


Theorem 10.8. If X(t) passes through an LTI system to yield Y(t), then the cross 
power spectral density 7s 


Proof. By taking the Fourier transform on Ry,x (rT) we have that Sy,x(w) = H(w)Sx(w). 
Since Rx y(t) = Ry,x(-7T), it holds that Sx y (w) = H(w)Sx(w). 


Example 10.19. Let X(t) be a WSS random process with 
Rx(r)=e7 ?, Hw) =e”, 
Find Sx y(w), Rx y(t), Sy (w) and Ry ( Ne 


Solution. First, by the Fourier transform table we know that 
Sx (w) = Vane” /2, 


Since H(w) = e~“”/?, we have 


The cross-correlation function is 


Rx y(w) = a [v2me~“"| 


655 


CHAPTER 10. RANDOM PROCESSES 


The power spectral density of Y(t) is 
Sy (w) = |H(w)|?Sx (w) 
oe 
= V27e— a 


Therefore, the autocorrelation function of Y(t) is 


10.6 Optimal Linear Filter 


In the previous sections, we have built many tools to analyze random processes. Our next 
goal is to apply these techniques. To that end, we will discuss optimal linear filter design, 
which is a set of estimation techniques for predicting and recovering information from a time 
series. 


10.6.1 Discrete-time random processes 


We begin by introducing some notations. In the previous sections, we have been using 
continuous-time random processes to study statistics. In this section, we mainly focus on 
discrete-time random processes. The shift from continuous-time to discrete-time is straight- 
forward as far as the theories are concerned — we switch the continuous-time index t to 
a discrete-time index n. However, shifting to discrete-time random processes can simplify 
many difficult problems because many discrete-time problems can be solved by matrices and 
vectors. This will make the computations and implementations much easier. To make this 
transition easier, we provide a few definitions and results without proof. 


Notations for discrete-time random processes 
e We denote the discrete-time indices by m and n, corresponding to the continuous- 
time indices t; and tz, respectively. 
e A discrete-time random process is denoted by X[n]. 


e Its mean function and the autocorrelation function are 


ux(n] = E[X[n], 
Rx [m,n] = E[LX[m]X[n]]. 


e We say that X[n] is WSS if ux[n] = constant, and Rx[m,n] is a function of 
m—n. 
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e If X[n] is WSS, we write Rx [m,n] as 
Rx|m,n] = Rx[m—n| = Rx[k], 


where k = m — n is the interval. 


e If X[n] is WSS, we define the power spectral density as 


Se) — F{Rx[kl}, 


where Sx (e/”) denotes the discrete-time Fourier transform. 


When a random process X[n] is sent through an LTI system with an impulse response 
h[n], the output is 


Y[n] = h{n]* X[n] = So h{k]X[n— ki. (10.33) 


When a WSS process X [n] passes through an LTI system h[n] to yield an output Y[n], 
the auto- and cross-correlation function and power spectral densities are 


© Ry[k] =ElY[n + kIY [nl], Sy(e™) = F{ Ry [A]} = |H(e%) 28x (e™. 


e Rxy[k] = E[X[n+ KY [n]], Sxv (eC) = F{Rxy[k]} = He) Sx(e). 


e Ry x|k] = It 


10.6.2 Problem formulation 


The problem we study here is known as the optimal linear filter design. Suppose that there 
is a WSS process X[n] that we want to process. For example, if X[n] is a corrupted version 
of some clean time-series, we may want to remove the noise by filtering (also known as 
averaging) X[n]. Conceptualizing the denoising process as a linear time-invariant system 
with an impulse response h[n], our goal is to determine the optimal h[n] such that the 
estimated time series Y[n] is as close to the true time series Y[n] as possible. 

Referring to Figure 10.24, we refer to X[n] as the input function and to Y[n] as the 
target function. X[n] and Y[n] are related according to the equation 


K-1 
Y[n] = S > Alk]X[n — kl + Bln], (10.34) 


Y[n] 


where E[n] is a noise random process to model the error. The linear part of the equation 
is known as the prediction and is constructed by sending X[n] through the system. For 
simplicity we assume that X[n] is WSS. Thus, it follows that Y[n] is also WSS. We may 
also assume that we can estimate Rx[k], Ry x[k], Rxy[k] and Ry|k]. 
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Optimal Filter Target Function 
Input Function Y [n] 
X[n] h[n] 
Y [n] 


Predicted Function 


Figure 10.24: A schematic diagram illustrating the optimal linear filter problem: Given an input function 
X|[n], we want to design a filter h[n] such that the prediction Y [n] is close to the target function Y[n]. 


Example 10.20. If we let K = 3, Equation (10.34) gives us 


h[O]X [rn] + A[L.X[n — 1] + A[QX[n — 2] + Llr. 


That is, the current sample Y [n] is a linear combination of the previous samples X [n], 
X([n— 1] and X[n — 2]. 


Given X[n] and Y[n], what would be the best guess of the impulse response h[n] so 
that the prediction is as close to the true values as possible? From our discussions of linear 
regression, we know that this is equivalent to solving the optimization problem 


K-1 : 
minimize { Y[n]— A[k]X[n—k]] . (10.35) 
{hlk] } eo" 


The choice of the squared error is more or less arbitrary, depending on how we want to 
model E[n]. By using the square norm, we implicitly assume that the error is Gaussian. 
This may not be true, but it is commonly used because the squared norm is differentiable. 
We will follow this tradition. 

The challenge associated with the minimization is that in most of the practical set- 
tings the random processes X[n] and Y[n] are changing rapidly because they are random 
processes. Therefore, even if we solve the optimization problem, the estimates h[k] will be 
random variables since we are solving a random equation. To eliminate this randomness, we 
take the expectation over all the possible choices of X[n] and Y[n], yielding 


K-1 2 
minimize Y[n] - h[k]X|[n—k] } , 
{hIR] eo 


K-1 : 
minimize Ex y | {| Y[n] — A[k]X[n — k] 
{hik]} 0" k=O 


The resulting impulse responses h[k], derived by solving the above minimization, is 


known as the optimal linear filter. It is the best linear model for describing the input- 
output relationships between X[n] and Y[n]. 
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What is the optimal linear filter? 
The optimal linear filter is the solution to the optimization problem 


minimize 


{Alki 


10.6.3. Yule-Walker equation 


To solve the optimal linear filter problem, we first perform some (slightly tedious) algebra 
to obtain the following results: 


Lemma 10.4. Let Y[n] = a h|k|X |[n — k] be the prediction of Y|n]. The squared- 


norm error can be written as 


Proof. We expand the error as follows: 


xy (a - Fin)) | = Ey [(¥[n))?] - 2Exy [Yi]? (nl] + Bx [(Pnl)?) - 


The first term is the autocorrelation of Y [n]: 


iy [(¥[n])?] =E[Y[n + 0]Y [n]] = Ry [0]. (10.38) 
The second term is 
= K-1 
hay [¥ rl¥ tr] =Exy |¥[n] So alk X[n — i 
— k=0 
= AlKJEx,y [Y[n]X[n — k]] 
ei 
= h[{k] Ry xk]. (10.39) 
k=0 
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The third term is 


K-1 K-1 
ax |(P[n})?| = Ex (= lk] X|n — i DY AIX = 3] 


k=0 j=0 
K-1K-1 
- h[k\ALjJEx [X[n — k]X[n — Jl] 
k=0 j=0 
K-1K-1 
- hlk\n{j]Rx [i — hl. (10.40) 
k=0 j=0 


This completes the proof. 


The significance of this theorem is that it allows us to write the error in terms of Ry x [k], 
Rx[k] and Ry|k]. As we have mentioned, while we can solve the randomized optimization 
Equation (10.35), the resulting solution will be a random vector depending on the particular 
realizations X[n] and Y[n]. Switching from Equation (10.35) to Equation (10.36) eliminates 
the randomness because we have taken the expectation. The resulting optimization according 
to the theorem is also convenient. Instead of seeking individual realizations, we only need 
to know the overall statistical description of the data through Ry x[k], Rx[k] and Ry [kl]. 
These can be estimated through modeling or pseudorandom signals. 

The solution to the optimal linear filter problem is summarized by the Yule-Walker 
equation: 


Theorem 10.9. The solution {h[0],...,h[A& —1]} to the optimal linear filter problem 


minimize Ex y (10.41) 
{hik]}e=0) 


is given by the following matrix equation: 
DBoase [0] Rx [0] Rx [1] 
el Rx{l}—-Rx(0) 


Ryx [K = il 


Reiko lh =] 


which is known as the Yule-Walker equation. 


Therefore, by solving the simple linear problem given by the Yule-Walker equation, we will 
find the optimal linear filter solution. 


Proof. Since the error is a squared norm, the optimal solution is obtained by taking the 
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derivative: 
ages» |(¥iel-Pta) | 
d K-1 K-1kK— 
= a Ry(0] —2 ne h[k|Ryx[k Le > iJRxlj — 4) 
K-1 
=0-2Ryx[i] +2 >> Ale] Rx[é — A] 
k=0 


Equating the derivative to zero yields 


K-1 
Ryx() = So ARJRxfi-k],  6=0,...,K-1, 
k=0 


and putting the above equations into the matrix-vector form we complete the proof. 


The matrix in the Yule-Walker equation is a Toeplitz matrix, in which each row is 
a shifted version of the preceding row. This matrix structure is a consequence of a WSS 
process so that the autocorrelation function is determined by the time difference k and not 
by the starting and end times. 


Remark. If we take the derivative of the loss w.r.t. h[i], we have that 


0= ia ie (va - Fn) | = IE (vt - ¥{nJ) X{n- 3] 


This condition is known as the orthogonality condition, as it says that the error Y [n] -~Y{[n] 
is orthogonal to the signal X[n — ¢]. 


10.6.4 Linear prediction 


We now demonstrate how to use the Yule-Walker equation in modeling an autoregressive 
process. The procedure in this simple example can be used in speech processing and time- 
series forecasting. 

Suppose that we have a WSS random process Y [n]. We would like to predict the future 
samples by using the most recent K samples through an autoregressive model. Since the 
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model is linear, we can write 
K 
¥[n] = S— Alk|Y [n — k] + Bln]. (10.43) 


In this model, we say that the predicted value Y[n] is a linear combination of the past 
samples, albeit to approximation error E[n]. 
The problem we need to solve is 


minimize E (vim = Fin) | 


h[k] 


Since Y [n] is written in terms of the past samples of Y [n] in this problem, in the Yule-Walker 
equation we can replace X with Y. Consequently, we can write the matrix equation from 


Ry x [0] Rx [0] Rx(I] + Rx[kK-1] h[0] 
Ryx{1 Rx{[] Rx[0] +++ Rx[K-2] All] 
Ree el Relea) Relea) ee Rel) h[K —1] 
t 
° Ry{l Ry) Ry] Ry[K—J n(o} 
Ry [2] Ry [1] Ry(0] «+. Ry[K—2] All] 
; = : : : : : (10.44) 
Ry(K] Ry[K-1]) Ryfk-2) = Ry(0] h[K — 1] 
— ——— 


On a computer, solving the Yule-Walker equation requires a few steps. First, we need 
to estimate the correlation 


N 
Ry(k] = EV [n+ KY [nl] © = SOY int AY (r. 


The averaging on the right-hand side is often done using xcorr in MATLAB or np. correlate 
in Python. A graphical illustration of the input and the autocorrelation function is shown 
in Figure 10.25. 

After we have found Ry|n], we need to construct the Yule-Walker equation. For this 
linear prediction problem, the left-hand side of the Yule-Walker equation is the vector r, 
defined according to Equation (10.44). The Yule-Walker equation also requires the matrix R. 
This R can be constructed via the Toeplitz matrix as 


R= Toeplitz Ry), Ry (1, teeny Ry|K = ul. 


In MATLAB, we can call Toeplitz to construct the matrix. In Python, the command is 
lin.Toeplitz. 

To solve the Yule-Walker equation, we need to invert the matrix R. There are built-in 
commands for such an operation. In MATLAB, the command is \ (the backslash), whereas 
in Python the command is np. linalg.1stsq. 
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0.2 


=_Y{[n] 


0.1; 


-0.1 + 


-0.2 f f 1 1 1 1 -0.2 1 1 1 1 1 1 
0 50 100 150 200 250 300 -300 -200 -100 0 100 200 300 


(a) Y[n] (b) Ry [k] 


Figure 10.25: An example time-series and its autocorrelation function. 


MATLAB code to solve the Yule Walker Equation 
load(’data_chi0.txt’); 
10; 
320; 


xcorr(y); 

= Toeplitz(y_corr(N+[0:K-1])); 
y_corr (N+[1:K]); 
R\lhs; 


Python code to solve the Yule Walker Equation 
= np.loadtxt(’./data_ch10.txt’) 

= 10 

= 320 


np.correlate(y,y,mode=’full’) 
lin.Toeplitz(y_corr[N-1:N+K-1]) #call scipy.linalg 
y_corr [N:N+K] 

np.linalg.1stsq(R,lhs,rcond = None) [0] 


Note that in both the MATLAB and Python codes the Toeplitz matrix R starts with 
the index N. This is because, as you can see from Figure 10.25, the origin of the autocor- 
relation function is the middle index of the computed autocorrelation function. For r, the 
starting index is N + 1 because the vector starts with Ry [1]. 

To predict the future samples, we recall the autoregressive model for this problem: 


7 K-1 
Y[n] = A[k]Y [n — k]. 
k=0 
Therefore, given Y[n — 1], Y[n — 2],...,¥ [mn — K], we can predict Y[n]. Then we insert this 


predicted Y In] into the sequence and increment the estimation problem to the next time 
index. By repeating the process, we will be able to predict the future samples of Y [n]. 
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Figure 10.26 illustrates the prediction results of the Yule-Walker equation. As you can 
see, the predictions are reasonably meaningful since the patterns follow the trend. 


0.2 


=== Prediction 
=== Input 


0 50 100 150 200 250 300 350 


Figure 10.26: An example of the predictions made by the autoregressive model. 


The MATLAB and Python codes are shown below. 


% MATLAB code to predict the samples 
Zz = y(311:320); 

yhat = zeros(340,1); 

yhat(1:320) = y; 


for t = 1:20 
predict z’?*h; 
Z [z(2:10); predict]; 
yhat (320+t) = predict; 

end 


plot(yhat, ’r’, ’LineWidth’, 3); hold on; 
plot(y, ?k?, ?LineWidth’, 4); 


# Python code to predict the samples 
z = y[310:320] 

yhat = np.zeros((340,1)) 

yhat [0:320,0] = y 


for t in range(20): 
predict = np.inner(np.reshape(z, (1,10)) ,h) 
Z = np.concatenate((z[1:10], predict)) 
yhat [320+t,0] = predict 


plt.plot(yhat,’r’) 


plt.plot(y,’k’) 
plt.show() 
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10.6.5 Wiener filter 


In the previous formulation, we notice that the impulse response has a finite length. There 
are, however, problems in which the impulse response is infinite. For example, a recur- 
sive filter h[n] will be infinitely long. The extension from finite length to infinite length is 
straightforward. We can model the problem as 


Y[n]= So h{k]X[n — k] + Elrl. 
k=—0o 


However, when h[n] is infinitely long the Yule-Walker equation does not hold because the 
matrix R will be infinitely large. Nevertheless, the building block equation for Yule-Walker 
is still valid: 


Ryx[i}= 5° Alk|Rx[i-k). (10.45) 
To maintain the spirit of the Yule-Walker equation while enabling computation, we 


recognize that the infinite sum on the right-hand side is, in fact, a convolution. Thus we 
can take the (discrete-time) Fourier transform of both sides to obtain 


Syx(el”) = H(e3”)Sx (e3”). (10.46) 


Therefore, the corresponding optimal linear filter (in the Fourier domain) is 


ju — Syx(e*) 
Hel) = Ga) (10.47) 
and 
ninja rf SEN) 
Sx (e-J¥) 


The filter obtained in this way is known as the Wiener filter. 


Example 10.21. (Denoising) Suppose X[n] = Y[n] + W[n], where W[n] is the noise 
term that is independent of Y[n], as shown in Figure 10.27. 


Noise Wn} 


Optimal Filter 


Target Function Input Function Predicted Function 


Y(n]) —>BD— Xn 


Figure 10.27: Design of a Wiener filter that takes an input function X[n] and outputs an estimate 
Y[n] that is close to the true function Y [n]. 


Now, given the input function X[n], can we construct the Wiener filter h[n] such 


that the predicted function Y[n] is as close to Y[n] as possible? The Wiener filter for 
this problem is also the optimal denoising filter. 
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Solution. The following correlation functions can easily be seen: 


Rx[k] = E[X[n + k]X[n]] 

[(Y[n + k] +W 

= E[Y[n + k]Y[n]] 4 
+E[W[n+k]Y[n]] +E 

= E[Y[n + k]Y[n]] 4 

= Ry|k] + Rwl&l. 


Similarly, we have 


Ry x|k] = E[Y[n + k]X[n 
-E[Y[n](V[In+k 


Consequently, the optimal linear filter is 


- Sy (ei) + Sw (eJ”) i 


What is the Wiener filter for a denoising problem? 


e Suppose the corrupted function X [n] is related to the clean function Y [n] through 
X[n] = Y[n] + W[n], for some noise function W[n]. 


e The Wiener filter is F 
. Sy ej” 

Hei”) = i we 10.4 

(On) = Be) + Swle™) oe 


e To perform the filtering, the denoised function Y[n] is 


Y [n] = F-1 {H(e) X(e%)} . 


Figure 10.28 shows an example of applying the Wiener filter to a noise removal prob- 
lem. In this example we let W[n] be an i.i.d. Gaussian process with standard deviation 
a = 0.05 and mean yw = 0. The noisy samples of random process X[n] are defined as 
X[n] = Y[n] +W|n], where Y[n] is the clean function. As you can see from Figure 10.28(a), 
the Wiener filter is able to denoise the function reasonably well. 

The optimal linear filter used for this denoising task is infinitely long. This can be seen 
in Figure 10.28(b), where the filter length is the same as the length of the observed time 
series X[n]. If X[n] is longer, the filter h[n] will also become longer. Therefore, finite-length 
approaches such as the Yule-Walker equation do not apply here. 


666 


10.6. OPTIMAL LINEAR FILTER 


0.2 - 
0.1; & 

0 
-0.1 

Noisy Input X[n] 

-0.2 | — Wiener Filtered Yhat[n] 

| [eeamaaneae Ground Truth Y[n] 

i f f : ! : -0.05 : ! f : : : 
0 50 100 150 200 250 300 -300 -200 -100 0 100 200 300 
(a) Noise removal by Wiener filtering (b) Wiener filter 


Figure 10.28: (a) Applying a Wiener filter to denoise a function. (b) The Wiener filter used for the 
denoising task. 


The MATLAB / Python codes used to generate Figure 10.28(a) are shown below. 
The main commands here are scipy.fft and scipy.ifft, which are available in the scipy 
library. The commands Yhat = H.*fft(x, 639) in MATLAB execute the Wiener filtering 
step. Here, we resample the function x to 639 samples so that it matches with the Wiener 
filter H. Similar commands in Python are H * fft(x, 639) 


MATLAB code for Wiener filtering 
= 0.05*randn(320,1); 
y tw; 


= xcorr(y); 
xcorr (w) ; 
= fft(Ry); 
Sw = fft (Rw); 
H = Sy./(Sy + Sw); 
Yhat = H.*fft(x, 639); 
yhat = real(ifft(Yhat)) ; 


plot(x, ’LineWidth’, 4, ’Color’, [0.7, 0.7, 0.7]); hold on; 
plot (yhat(1:320), ’r’, ’LineWidth’, 2); 
plot(y, ’k:’, ’LineWidth’, 2); 


# Python code for Wiener filtering 
from scipy.fft import fft, ifft 

w = 0.05*np.random.randn (320) 
x=ytw 


Ry = np.correlate(y,y,mode=’full’) 
Rw = np.correlate(w,w,mode=’full’) 
Sy = fft(Ry) 

Sw = fft (Rw) 

H = Sy / (Sy+Sw) 

Yhat = H * fft(x, 639) 
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yhat = np.real(ifft(Yhat) ) 


plt.plot(x,color=’ gray’) 
plt.plot (yhat [0:320],’r’) 
plt.plot(y,’k:’) 


Example 10.22. (Deconvolution) Suppose that the corrupted function is generated 
according to a linear process given by 


Co 


do Yn -— 4+ Win], 


L=—0co 


where g|n] is the impulse response of some kind of degradation process and W[n] is 
the Gaussian noise term, as shown in Figure 10.29. Find the optimal linear filter (i.e., 
the Wiener filter) to estimate Y[n]. 


Noise W{n] 


Degradation | Optimal Filter 


Input Function Predicted Function 


Target Function 


Y[n] —>| —sg[n] —dh)— X[n] 


Figure 10.29: Design of a Wiener filter that takes an input function X[n] and outputs an estimate 
Y [n] that is close to the true function Y [n]. 


Solution. To construct the Wiener filter, we first determine the cross-correlation func- 
tion: 


Ry x|[k] =E[Y[n+k]X[n]] =E 


Using algebra, it follows that 


E/Y[n+k] s gl[f]Y [n — 4) + W 


L=—0o 


(Y[n + k]¥[n — ]] + E[Y[n + k]W[n]] 


= SO glQRylk+4+0=(g @ Ry)[hi, 


L=—0o 


which is the correlation between g and Ry. Therefore, the cross power spectral density 
Syx (e! a) is 


Sy x (eI) — G(ei~) Sy (e3”). 
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The autocorrelation of this problem is 


Rx|k] = E[X[n + k]X[n]] 

El((g*Y)[n+k] + W[n + k])((g * Y)[n] + W{n))] 
=E[(g*Y)[n+ k](g * Y)[n]] + E[W[n + k]W[n]] 
(9 ® (g* Ry))[k] + Rw}, 


I 


where, according to the previous section, the first part is the correlation ® followed by 
a convolution *. Therefore, the power spectral density of X is 


Sx(e™) = |G(e")|?Sy (e”) + Sw(e**). 
Combining the results, the Wiener filter is 


_ Sy x (e”) G(e3”) Sy (e3”) 


Sx(e) — |G(ei) |? Sy (e™) + Sw (et) 


H(e3”) 


What is the Wiener filter for a deconvolution problem? 


e Suppose that the corrupted function X[n] is related to the clean function Y [n] 
through X[n] = (g * Y)[n] + W[n], for some degradation g[n] and noise W[n]. 


e The Wiener filter is 


G(ei) Sy (e%) 


He) = [Gey PSy(e) + Swe) 


(10.49) 


e To perform the filtering, the estimated function Y[n] is 


Y(n) = F-1 {H(e) X(e%)} . 


As an example of the deconvolution problem, we show a WSS function Y[n] in Fig- 
ure 10.30. This clean function Y[n] is constructed by passing an i.i.d. noise process through 
an arbitrary LTI system so that the WSS property is guaranteed. Given this Y[n], we con- 
struct a degradation process in which the impulse response is given by g[n]. In this example, 
we assume that g[n] is a uniform function. We then add noise W[n] to the time series to 
obtain the corrupted observation X[n]. The reconstruction by the Wiener filter is shown in 
Figure 10.30. 

The MATLAB and Python codes used to generate Figure 10.30 are shown below. 


% MATLAB code to solve the Wiener deconvolution problem 
load(’chi0_wiener_deblur_data’); 

g = ones(32,1)/32; 

w = 0.02*randn(320,1); 

x = conv(y,g,’same’) + w; 


Ry = xcorr(y); 
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0.6 


0.4 


=== Noisy Input X[n] 
-0.6 + —Wiener Filtered Yhat[n] 4 
serene Ground Truth Y[n] 


-0.8 | | | | | 
50 100 150 200 250 300 


Figure 10.30: Reconstructing time series from degraded observations using a Wiener filter. 


Rw = xcorr(w); 


Sy = fft(Ry); 
Sw = fft(Rw); 
G = fft(g,639); 


H = (conj(G).*Sy)./(abs(G) .*2.*Sy + Sw); 
Yhat = H.*fft(x, 639); 
yhat = real(ifft(Yhat)); 


figure; 

plot(x, ’LineWidth’, 4, ’Color’, [0.5, 0.5, 0.5]); hold on; 
plot (16:320+15, yhat(1:320), ’r’, ’LineWidth’, 2); 
plot(1:320, y, ’k:’, ’LineWidth’, 2); 


# Python code to solve the Wiener deconvolution problem 
y = np.loadtxt(’./ch10_wiener_deblur_data.txt’) 

g = np.ones(64) /64 
wW 
x 


0.02*np. random. randn (320) 
np.convolve(y,g,mode=’same’) + w 


Ry = np.correlate(y,y,mode=’full’) 

Rw = np.correlate(w,w,mode=’full’) 

Sy = fft(Ry) 

Sw = fft (Rw) 

G = fft(g,639) 

H = (np.conj(G)*Sy)/( np.power(np.abs(G) ,2)*Sy + Sw ) 


Yhat 
yhat 


H * fft(x, 639) 
np.real (ifft (Yhat) ) 


plt.plot(x,color=’ gray’) 
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plt.plot (np. arange(32,320+32) ,yhat [0:320],’r’) 
plt.plot(y,’k:’) 


Caveat to Wiener filtering. In practice, the above Wiener filter needs to be modified 
because Sy(e/”) and Sw (e”) cannot be estimated from the data via the temporal corre- 
lation (as we did in the MATLAB/Python programs). The reason is that we never have 
access to Y[n] and W[n]. In this case, one has to guess the power spectral densities Sy (e/”) 
and Sw(e”). The noise power Sy (e/”) is usually not difficult to estimate. For example, 
in the program we showed above, the noise power spectral density is Sw = 0.0272*320 
(MATLAB), which is the noise standard deviation times the number of samples. 

The signal Sy (e/”) is often the hard part. In the absence of any knowledge about the 
ground truth’s power spectral density, the Wiener filter does not work. However, for certain 
problems in which Sy(e’”) can be predetermined by prior knowledge, the Wiener filter is 
guaranteed to be optimal — optimal in the mean-squared-error sense over the entire time 
axis. 

Wiener filter versus ridge regression. The Wiener filter equation can be interpreted 
as a ridge regression. Denoting the forward observation model by 


z=Gy+w, 
the corresponding ridge regression minimization is 
G = argmin le — Gy||? + Allyl? 
= (GG +A1)'G"z. 


If G is a convolutional matrix, the above solution can be written in the Fourier domain (by 
using the Fourier transform as the eigenvectors): 


H(ej~) 


Comparing this “optimal linear filter” with the Wiener filter, we observe that the Wiener 
filter has slightly more generality: 


¥(ei”) = 


G(ei”) Sy (e3*) is 
|G (es) |? Sy (e”) + oo ay 


Therefore, in the absence of Sy(e%”) and assuming that Sy(e?”) is a constant (e.g., for 
Gaussian noise), the Wiener filter is exactly a ridge regression. 
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10.7 Summary 


Random processes are very useful tools for analyzing random variables over time. In this 
chapter, we have introduced some of the most basic mechanisms: 


e Statistical versus temporal analysis: The statistical analysis of a random process 
looks at the random process vertically. It treats X(t) as a random variable and studies 
the randomness across different realizations. The temporal analysis is the horizontal 
perspective. It treats X(t) as a function in time with a fixed random index. In general, 
statistical average # temporal average. 


e Mean function j:x(t): The mean function is the expectation of the random process. 
At every time t, we take the expectation to obtain the expected value E[X (t)]. 


e Autocorrelation function Rx (ty, tz). This is the joint expectation of the random pro- 
cess at two different time instants t; and t2. The corresponding values X(t,) and X (t2) 
are two random variables, and so the joint expectation measures how correlated these 
two variables are. 


e Wide-sense stationary (WSS): This is a special class of random processes in which 
x(t) isa constant and Rx (t1, te) is a function of t; —t2. When this happens, the auto- 
correlation function (which is originally a 2D function) will have a Toeplitz structure. 
We write Rx(ti,t2) as Rx (rT), where r = ti — ta. 


e Power spectral density (PSD): This is the Fourier transform of the autocorrelation 
function Rx (7), according to the Einstein-Wiener-Khinchin theorem. It is called the 
power spectral density because we can integrate it in the Fourier space to retrieve the 
power. This provides us with some convenient computational tools for analyzing data. 


e Random process through a linear time-invariant (LT!) system: This tells us how a 
random process behaves after going through an LTI system. The analysis can be done 
at the realization level, where we look at each random process, or at the statistical 
level, where we look at the autocorrelation function and the PSD. 


Optimal linear filter: A set of techniques that can be used to retrieve signals by using 
the statistical information of the data and the system. We introduced two specific 
approaches: the Yule-Walker equation for a finite-length filter and the Wiener filter 
for an infinite-length filter. We demonstrated how these techniques could be applied 
to forecast a time series and recover a time series from corrupted measurements. 


While we have covered some of the most basic ideas in random processes, there are 
also several topics we have not discussed. These include, but are not limited to: strictly 
stationary process, a more restrictive class of random process than WSS; Poisson process, 
a useful model for arrival analysis; Markov chain, a discrete-time random process where 
the current state only depends on the previous state. Readers interested in these materials 
should consult the references listed at the end of this chapter. 
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10.8 Appendix 


The Einstein-Wiener-Khinchin theorem 


The Einstein-Wiener-Khinchin theorem is a fundamental result. It states that for any wide- 
sense stationary process, the power spectral density Sx(w) is the Fourier transform of the 
autocorrelation function. 


Theorem 10.10 (The Einstein-Wiener-Khinchin theorem). For a WSS random pro- 
cess X(t), 
Sx(w) =F {Rx(7)}, (10.50) 


whenever the Fourier transform of Rx(r) exists. 


Proof. First, let’s recall the definition of Sx (w): 
Sx(w) lim aE [Kru )| ar (10.51) 
T00 2 


By expanding the expectation, we have 


E||Xr(w)|?] = =e ((f' xe X (t)e* i) (0 X (0)e~ 58 i) ] 
= 7, , [. EKO XO) e IH(t—9) dt dO = [. : [. : Rx (t — de Fe det dé. 


(10.52) 
Our next step is to analyze Rx(t — @). Define 
Qx(v) =F {Rx(7)}. (10.53) 
Then, by inverse Fourier transform 


Rx(r) = =| Qx(v)e?"” du, 


and therefore 


df 
Rx(t-0)= al Qx(v)eit-®) dv 


Substituting this into Equation (10.52) yields 


eats ff (ae [avin a) mtn aa 
= xf. Qx(v) if eft(v—w) ” if ei iu—2) na dv. 
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We now need to simplify the two inner integrals. Recall by Fourier pair that 


t ‘ wl 
rect (=) F  Tsinc (+) ; 
This implies that 


Te T 
) cit(v—w) dt = if ec I(wrvy)t dt 
_T _-T 


= [- rect( se HO") dt = OF sino((w — v)T) = 27 ——ws 


Hence, we have 


[| Xr(w)| Jao f Qx(v ) (er ee) dv. (10.54) 
and so ‘ 
a | Xr(w)? = — = Qx(v) (ae) dv. (10.55) 


As T > co (see Lemma 10.5 below), we have 


ve (ae —v)T) 


) > Ind(w—v). 


(w—v)T 
Therefore, 
im ~ : [Frw)P] _ = _ Ox(e) im oT (ae) dv 
_ ‘ Qx(v)d(w — v) dv = Qx(w) 


Since Qx(w) = F[Rx(r)], we conclude that 


Sx(w) = lim — a[| Xow) |?] = Qx(w) =F[Rx(7)]. 


Lemma 10.5. 


ell. SGD ee I du = Qx(w). (10.56) 


To prove this lemma, we first define d7(w) = 2T (eee It is sufficient to show that 


lim =f. Qx(v war (EID) Qxlw)| +0 as Too. (10.57) 


We will proceed by demonstrating the following three facts about d7(w): 
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2. For any A > 0, 
) dr(w)dw>0 as Too 
{w:|w| >A} 


3. For any |w| > A > 0, we have |67(w)| < az. 


Proof of Fact 1. 


ae wT 
sinc? (wT) 
Note that 
t 
A (zr) <-> 2Tsinc? (wT). 

Therefore, 

ee eo +2 ae he ee) jw0 

— 2Tsinc* (wT) dw = — 2Tsinc* (wT )e?*" dw 

27 Joo 27 Joo 


[O snoyav= far (02) a 


2T [© sin? (wt) 


T2 A w2 
2° 1 : 
< a 5 dus |sin(.)|? <1 
2 ye 2 
=7|-2| -m >0O as Too 


Proof of Fact 3. 


rw) =ar (MEP) <or (4) = ee 


Proof of Lemma. Consider Qx (w). By Property 1, 


1 co 


Qx(w) = Qx(w)-5- i: br(w—v) dv = oe 
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Therefore, 


5. ie Qx(v)dr(w — v) dv — Qx(w) 


ae Qx(v)dr(w — v) du — xf. Qx(w)dr(w — v) du 
1 


Or 


ae a |Qx(v) — Qx(w)|5r(w = v) dv. 


i‘. (Qx(v) — Qx(w)) 6rw = v) du 


For any € > 0, let A be a constant such that 
lu —v| <A whenever |Qx(v) — Qx(w)| <e. 
Then we can partition the above integral into 


1 1 


= [ |exw) — Qx(v) |r — v) do igs ext — Qx(v)|5rW — v) dv (1) 
Me ai arte ~Qx(v)|érw—v) dv) 
TW Jota 
wtA 
ES =| |x (w) — Qx(v)|5r(w —v) dv. (3) 


Partition (1) above can be evaluated as follows: 


‘ wtA 
in fn (Ox) - Qx(o)|ér(w — v) dv 
1 wtA 
<— = 
= 97 en 607 (w v) dv 
€ wtA 
nas = os = v) dv 


< ae or(w —v) d 


where the last inequality holds because dr(w — v) > 0. Since € can be arbitrarily small, the 


only possibility for 
1 wtaA 


Qn 
for all € is that the integral is 0. 


|Qx(w) a Qx(v)|6r(w ieee v) dv 


Partition (2) above can be evaluated as follows: 


x |Qx(w) — Qx(v)|6r(w — v) du 
TW Juota 
<of- exe )] + |Qx(v)|) dr(w — v) do 
_ oie f°, dr (w — v) du + = Lik Qx(v)dr(w — v) dv. 
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1 [, Sr(w — v) dv + 0 as T — oo. By Property 3, 


By Property 2 wid 


> On 


1 ee 1 2 
— } —v)dv< — 
2m Ju4A i aie ee 2n TA? w+ 


Qx(v) dv > 0. 


<oo because Qx (v)=F[Rx(rT)] 
Therefore, we conclude that 
1 co 
— Qx(v)dr(w-—v)dv>0 as Tom. 
20 w+tA 
and hence (1), (2) and (3) all > 0 as T — oo. So we have 


lim / ” Qx(v)2P (men) dv — Qx(w) 


>-0 as Tou, 


which completes the proof. 


10.8.1 The Mean-Square Ergodic Theorem 


The mean-square ergodic theorem states that for any WSS random process, the statistical 
average is the same as the temporal average. This provides an important tool in practice 
because finding the statistical average is typically very difficult. With the mean ergodic 
theorem, one can easily estimate the statistical average using the temporal average. 


Theorem 10.11 (Mean-Square Ergodic Theorem). Let Y(t) be a WSS process, 
with mean E[Y (t)] =m and autocorrelation function Ry (rT). Assume that the Fourier 
transform of Ry(r) exists. Define 


eal 
Mr ws f. Y(t (10.58) 


E [Mer _ m|] >0asT—-> o. 
Proof of Mean Ergodic Theorem. Let X(t) = Y(t) — m. It follows that 


Mp—m= sf Y(t) dt — maa f X(t 


We define the finite-window approximation of X (t): 


[| 2, Peter 
dU { 0, elsewhere. 
Then the difference Mr — m can be computed as 
eee PX at = af. X(t)e79 dt = — Xp(w)|,_. = Xr(0) 
- . ~ Oe EI ~ OPT w=0 ~ 9p 
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Taking the expectation of the squares yields 


o [|&r(0)|"] 
aT? 


E [|Mr — m|*] = 


Recall from the Einstein-Wiener-Khinchin theorem, 


se [ 0] <2 J sven (HSI) a 


Putting the limit T’ — oo, if we have that 


Hence, 
: 2 : 1 > 2 : 
jin 8 [ltr —mF] = in, 22 [OF] = fm, S00) =0 
This completes the proof. 
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10.10 Problems 


Exercise 1. (VIDEO SOLUTION) 
Consider the random process 


X(t) = 2A cos(t) + (B — 1) sin(t), 


where A and B are two independent random variables with E[A] = E[B] = 0, and E[A?] = 
E[B?] = 1. 


(a) Find px (t). 
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(b) Find Rx (ti, ta). 
(c) Find Cx (t1, te). 


Exercise 2. (VIDEO SOLUTION) 


Let X[n] be a discrete-time random process with mean function mx|n] = E{X[n]} and 
correlation function Rx [n,m] = E{X[n]X[m]}. Suppose that 
Y[nj= So h[n- i] X[i. (10.59) 


(a) Find py[n]. 
(b) Find Rxy[n, ml. 


Exercise 3. (VIDEO SOLUTION) 
Let Y(t) = X(t) — X(t —d). 


(a) Find Rx y(t) and Sx y(w). 
(b) Find Ry(r). 
(c) Find Sy(w). 


Exercise 4. (VIDEO SOLUTION) 
Let X(t) be a zero-mean WSS process with autocorrelation function Rx(r). Let Y(t) = 
X(t) cos(wt + ©), where 8 ~ uniform(—7, 7) and © is independent of the process X(t). 


(a 
(b 
(c) Is Y(t) WSS? Why or why not? 


Find the autocorrelation function Ry(r). 
Find the cross-correlation function of X(t) and Y(t). 


) 
) 
) 
) 


Exercise 5. (VIDEO SOLUTION) 
A WSS process X(t) with autocorrelation function 


Rx(r) =1/(1+ 77) 
is passed through an LTI system with impulse response 
h(t) = 3sin(at)/(at). 
Let Y(t) be the system output. Find Sy(w) and sketch Sy (w). 


Exercise 6. (VIDEO SOLUTION) 
A white noise X(t) with power spectral density Sx(w) = No/2 is applied to a lowpass filter 
h(t) with impulse response 


1 
hg) =e Ve ¢>0, 10. 
(t) RO” ; >0 (10.60) 
Find the followings. 
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Exercise 7. (VIDEO SOLUTION) 
Consider a WSS process X(t) with autocorrelation function 


Rx (7) =sinc(r7). 


The process is sent to an LTI system with input-output relationship 


oy) 422 y(t) 44y(t) = 3 ex (t) 34x) + 6X0) 
dt? dt de dt 
Find the autocorrelation function Ry (r). 
Exercise 8. (VIDEO SOLUTION) 
Given the functions a(t), b(t) and c(t), let 
g(t, 1) = a(t), 
g(t, 2) = b(t), 
g(t, 3) = c(t) 


PROBLEMS 


Let X(t) = g(t,Z), where Z is a discrete random variable with PMF P[Z = 1] = pi, 
P[Z = 2] = po and P[Z = 3] = ps. Find, in terms of the p,, po, ps, a(t), b(t) and c(t), 


(a) x(t). 
(b) Rx (ti, tz). 

Exercise 9. 

In the previous problem, let a(t) = e~*!"!, b(t) = sin(at) and c(t) = —1. 
(a) Choose pj, p2, p3 so that X(t) is WSS. 


(b) Choose pi, pe, ps so that X(t) is not WSS. 


Exercise 10. (VIDEO SOLUTION) 


Find the autocorrelation function Rx (rT) corresponding to each of the following power spec- 


tral densities: 
(a) d(w — wo) + 6(w + wo). 
(b) 6707/2, 


(c) eT lel, 
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Exercise 11. (VIDEO SOLUTION) 

A WSS process X(t) with autocorrelation function Rx(T) = e-? /(2er) ig passed through 
an LTI system with transfer function H(w) = e-**/(22%). Denote the system output by 
Y(t). Find the followings. 


Exercise 12. (VIDEO SOLUTION) 
A white noise X(t) with power spectral density Sx (w) = No/2 is applied to a lowpass filter 


h(t) with 
Fe 1—w?, if |w| = Tr, 
0, otherwise. 


Find E[|Y (t)|?], where Y(t) is the output of the filter. 


Exercise 13. (VIDEO SOLUTION) 
Let X(t) be a WSS process with correlation function 


1—|rl, if -l<7r<l, 
R = 10.61 
x(7) 0, otherwise. ( ) 


It is known that when X(t) is input to a system with transfer function H(w), the system 
output Y(t) has a correlation function 
sin TT 


Ry(r) = (10.62) 


TT 


Find the transfer function H(w). 


Exercise 14. 
Consider the system 


Assume that X(t) is zero-mean white noise with power spectral density Sx(w) = No/2. 
Find the followings: 


(a) Sxy(w). 
(b) Rxy(r). 
(c) Sy(w). 

) Ry(r). 
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Appendix 


Useful Identities 


Co 
L Srkaltrgrt+ sh 
k=0 
a n(n+1) 
2. Oe ae ee aS 
oO ihe 2 
3.e7 = 0B =14+954+5+ 
k=0 
CO 
4. 0 keh = 14 2r 4 3r? +---= Gp 
k=1 
n 
5. PSs ee ey See eG 
n 
6. (a+b)"= > (larg 
Common Distributions 
Distribution PMF / PDF E[X]  Var[X] Mx(s) 
Bernoulli px(1) =p and px(0) =1—p Pp p(l—p) 1—p+ pe* 
Binomial = px (k) = (j,)p*(1—p)”* np  — np(l—p) (1—p+pe*)” 
; a iL l—p pe® 
Geometric px(k) = p(1—p)*-} = 
ie ) : me D pe 1—(1—p)es 
Poisson px(k) = 5 A r avery) 
Gaussiar fx(x) = 1 e (e— ny" 2 : 5 ors? 
ssian fx(x Jaa xD 552 jb o exp 4 fs , 
: 1 1 » 
Exponential fx (x) = Aexp {—Aa} 5 ¥ : a 
; yo 4 a+b (b-a ef? —e"4 
Uniform fx(x) = — 5 3 So) 
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Sum of Two Random Variables 


xX Xo Sum X1 + Xo 

Bernoulli(p) Bernoulli(p) Binomial (2, p) 

Binomial(n, p) Binomial(m, p) Binomial(m + n, p) 
Poisson(A;) Poisson(A2) Poisson(A; + A2) 
Exponential(,) Exponential(,) Erlang(2, A) 

Gaussian([11, 0?) Gaussian([U2, 03) Gaussian(1 + f2, 0? +03) 


Fourier Transform Table 


F(w)= i. f(t)e 9! dt. 


f(t) Flw) f(t) Flw) 

1 Wt 2 
1. e u(t) <>} at jw’ a>0O 10. sinc? ( 2 ) < > WA (=) 

1 
2 eMu(—t) > — qu to 11. e7* sin(wot)u(t) ¢ a ay? pap a> 0 

= 2a _ a+ jw 
oT alt at 7 
3. eal ey eran? >0 12. e~* cos(wot)u(t) <> Ghia a>0 
2 2 2,2 

A, op <> rae"! a >0 13. exp {-£} <-> 270 exp {-2 a } 

1 
5. te “u(t) — ———,, a>0 14. dit) 1 

(a+ ju? 
3 n,—at n 
t . WT wt 
7. rect | — ] <> Tsinc (=) 16. d(t-—to) +9 e 7" 
TE 
8. _ +> preet i) 17. eJvot ¢ 5 2775(w — wo) 
9 A ( ) $5 Ssinc” (=) 18. f(t)ei#t + F(w — wy) 
ys 
bn ______________________* 


t 
sinc(t) = ont 
1, —05<t<0.5, 
rect(t) = ; 
0, otherwise. 
1—2\t —0.5<t<0.5 
A(t) = | IF = —_— ? 
0, otherwise. 


Basic Trigonometric Identities 


e)9 = cos0 + jsind 
sin 20 = 2sin@cos@ 


cos 20 = 2cos76—1 


1 
cos Acos B = 3 (cos(A + B) +cos(A — B)) 


sin Asin B = 5 (cos(A + B) cos(A — B)) 


sin Acos B = (sin(A + B)+sin(A — B)) 


cos Asin B= 5(sin(A B) —sin(A — B)) 


cos(A + B) = cos Acos B — sin Asin B 

( ) = cos Acos B + sin Asin B 

sin(A + B) = sin Acos B + cos Asin B 
in( ) 


= sin Acos B — cos Asin B 
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absolutely integrable, 183 
almost sure convergence, 362 
autocorrelation function 
2D visualization, 624 
interpretation, 625, 635 
LTI system, 647 
properties, 634 
temporal average, 638 
definition, 620 
MATLAB and Python, 628 
autocovariance function 
definition, 620 


relation to autocorrelation function, 630 


autoregressive model, 406, 661 
linear prediction, 661 


MATLAB and Python, 407, 662 


prediction, 663 
Toeplitz, 662 
Yule-Walker equation, 661 


Basel problem, 5 
basis functions, 405 
Bayes’ theorem, 89 
conditional probability, 81 
law of total probability, 90 
Bayesian, 43 
Bernoulli random variable 
definition, 137 
MATLAB and Python, 137 
maximum variance, 140 
properties, 138 
bias-variance 
average predictor, 433 
MATLAB and Python, 434 
noise-free case, 430 
noisy case, 433 
trade off, 429 
binomial random variable 
alternative definition, 148 


definition, 143 
MATLAB and Python, 144 
properties, 146 
binomial series, 6 
binomial theorem, 6 
proof, 9 
birthday paradox, 31, 321 
bootstrapping, 561 


bootstrapped distribution, 564 


confidence interval, 561 
definition, 561 

distribution of samples, 562 
interpretation, 566 
MATLAB and Python, 567 
procedure, 563 

standard error, 567 

when to use, 562 


Cauchy distribution, 331, 360 


Cauchy-Schwarz inequality, 261, 335 
Central Limit Theorem, 323, 367, 372, 381 


Berry-Esseen Theorem, 375 
examples, 376 
interpretation, 375 
limitations, 379 
proof, 374 
characteristic function, 329 
alternative definition, 329 
Fourier transform, 330 
Chebyshev’s inequality, 341 
proof, 342 
Chernoff’s bound, 343 
compare with Chebyshev, 344 
Chernoff, Herman, 343 
combination, 35 
concave function, 336 
conditional distribution 
conditional expectation, 275 
conditional PDF, 272 
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conditional PMF, 267 
conditional probability, 81 
Bayes’ theorem, 89 
definition, 81 
independence, 85 
properties, 84 
ratio, 81 
confidence interval, 543 
bootstrapping, 561 
critical value, 553 
definition, 548 
distribution of estimator, 546 
estimator, 545 
examples, 549 
how to construct, 549 
interpretation, 547 
margin of error, 554 
MATLAB and Python, 552 
number of samples, 555 
properties, 553 
standard error, 553 
Student’s t-distribution, 556 
conjugate prior, 513 
convergence in distribution, 368 
convergence in probability, 356 
convex function, 336 
convex optimization 
CVXPY, 451 
convolution, 220, 641 
correlation, 641 
filtering, 641 
correlation, 635 
autocorrelation function, 620 
autocovariance function, 620 
cross-correlation function, 652 
convolution, 641 
correlation coefficient 
MATLAB and Python, 265 
properties, 263 
definition, 263 
cosine angle, 26 
covariance, 262 
covariance matrix, 289 
independent, 289 
cross power spectral density, 654 
cross-correlation function 
cross-covariance function, 631 
definition, 631 


INDEX 


examples, 653 

through LTT systems, 652 
cross-covariance function, 631 

cross-correlation function, 631 
cumulative distribution function 

continuous, 186 

discrete, 121 

left- and right-continuous, 190 

MATLAB and Python, 186 

properties, 188 


delta function, 178 
discrete cosine transform (DCT), 23 


eigenvalues and eigenvectors, 295 
Gaussian, 296 
MATLAB and Python, 296 
Erdos-Rényi graph, 140 
MATLAB and Python, 480 
even functions, 15 
event, 61 
event space, 61 
expectation, 104 
continuous, 180 
properties, 130, 182 
transformation, 182 
center of mass, 127 
discrete, 125 
existence, 130, 183 
exponential random variables 
definition, 205 
MATLAB and Python, 205 
origin, 207, 209 
properties, 206 
exponential series, 12 


field, 64 
o-field, 65 
Borel o-field, 65 
Fourier transform, 647 
table, 330 
characteristic function, 330 
frequentist, 43 
Fundamental Theorem of Calculus, 17 
chain rule, 19 
proof, 18 


Gaussian random variables 
CDF, 214 
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definition, 211 
MATLAB and Python, 212 
origin, 220 
properties, 212 
standard Gaussian, 213 
geometric random variable 
definition, 149 
MATLAB and Python, 150 
properties, 151 
geometric sequence 
finite, 4 
infinite, 4 
geometric series, 3 
finite, 4 
infinite, 4 


harmonic series, 5 

histogram, 2, 113 

Hoeffding’s inequality, 348 
Hoeffding lemma, 348 
proof, 348 

hypothesis testing 
p-value test, 569, 573 
T-test, 576 
Z-test, 576 
alternative hypothesis, 568 
critical level, 571 
critical-value test, 569 
definition, 568 
MATLAB and Python, 570 
null hypothesis, 568 


impulse response, 646 
independence, 85 

conditional probability, 88 

versus disjoint, 86 
independent 

random variables, 252 


joint PDF, 247 

joint PMF, 245 
joint expectation, 257 

cosine angle, 258 


kurtosis, 216 
MATLAB and Python, 217 


Laplace transform, 324 
law of large numbers, 323, 351, 381 
strong law of large numbers, 361 
weak law of large numbers, 354 
learning curve, 427 
MATLAB and Python, 427 
Legendre polynomial, 403 
MATLAB and Python, 404 
likelihood, 466, 468, 503 
log-likelihood, 469 
linear algebra 
basis vector, 23 
representation, 23 
span, 22 
standard basis vector, 22 
linear combination, 21 
linear model, 21 
linear prediction, 661 
linear programming, 414 
linear regression 
MATLAB and Python, 30 
linear time-invariant (LTT) 
convolution, 641 
definition, 646 
system, 646 


marginal distribution, 250 
Markov’s inequality, 339 
proof, 339 
tight, 341 


independent and identically distributed (i.i.d.), matrix calculus, 28 


253 
indicator function, 182 
inner product, 24 
MATLAB and Python, 24 


Jensen’s inequality, 336 
proof, 338 

joint distribution 
definition, 241 
joint CDF, 255 
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maximum-a-posteriori (MAP), 503 
choosing prior, 505 
conjugate prior, 513 
MAP versus LASSO, 519 
MAP versus ML, 504 
MAP versus regression, 517 
MAP versus ridge, 519 
posterior, 503, 512 
prior, 503 
solution, 506 


maximum-likelihood 
1D Gaussian, 484 
consistent estimator, 494 
estimation, 468 
estimator, 491 
high-dimensional Gaussian, 486 
image reconstruction, 481 
independent observations, 469 
invariance principle, 500 
MATLAB and Python, 472 
number of training samples, 474 
Poisson, 485 
regression versus ML, 488 
social networks, 478 
unbiased estimator, 492 
visualization, 471 
mean, 199 
mean function 
LTT system, 647 
definition, 620 
MATLAB and Python, 623 
mean squared error (MSE), 521, 522 
measure, 68 
almost surely, 73 
finite sets, 68 
intervals, 68 
Lebesgue integration, 71 
measure zero sets, 71 
definition, 72 
examples, 72 
regions, 68 
size, 69 
median, 196 
minimum mean-square estimation (MMSE), 
521 
conditional expectation, 524 
Gaussian, 530 
minimum-norm least squares, 411 
mode, 198 
model selection, 165 
moment, 133 
continuous case, 184 
moment-generating function, 322, 324 
common distributions, 326 
derivative, 325 
existence, 331 
sum of random variables, 327 
multidimensional Gaussian, 290 
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MATLAB and Python, 291 
covariance, 293 
transformation, 293 
whitening, 299 


Neyman-Pearson test, 579 
decision rule, 584 
likelihood ratio, 586 
rejection zone, 580 
likelihood ratio test, 580 

norm, 24, 26 
l,, 27 
Ecos 2 
MATLAB and Python, 26 
weighted, 27 

normalization property, 112 


odd functions, 15 
open and closed intervals, 45 
optimal linear filter, 656 
deconvolution, 668 
denoising, 665 
orthogonality condition, 661 
Wiener filter, 664 
Yule-Walker equation, 659 
input function, 657 
prediction, 657 
target function, 657 
orthogonality condition, 661 
overdetermined system, 409 
overfitting, 418 
factors, 420 
LASSO, 454 
linear analysis, 425 
source, 429 


parameter estimation, 165, 465 
Pascal triangle, 8 
Pascal’s identity, 7 
performance guarantee 
average case, 321 
worst case, 321 
permutation, 33 
Poisson random variable 
applications, 154 
definition, 152 
origin, 157 
photon arrivals, 161 
Poisson approximation of binomial, 159 
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properties, 155 random process 
MATLAB and Python, 152 discrete time, 656 
positive semi-definite, 297 definition, 614 
posterior, 466, 503 example 
power spectral density, 639 random amplitude, 614 
Einstein-Wiener-Khinchin Theorem, 639 random phase, 615 
through LTI systems, 649 function, 614 
cross power spectral density, 642, 654 independent, 631 
eigendecomposition, 641 index, 614 
Fourier transform, 642 sample space, 616 
origin, 643 statistical average, 616 
wide-sense stationary, 642 temporal average, 616 
PR (precision-recall) curve uncorrelated, 632 
definition, 603 random variable, 104, 105 
MATLAB and Python, 605 function of, 223 
precision, 603 transformation of, 223 
recall, 603 random vector, 286 
principal-component analysis, 303 expectation, 288 
limitations, 311 independent, 286 
main idea, 303 regression, 391, 394 
MATLAB and Python, 306 loss, 394 
prior, 466, 503 MATLAB and Python, 400 
probability, 43, 45 outliers, 412 
measure of a set, 43 prediction model, 394 
probability axioms, 74 solution, 397 
additivity, 75 linear model, 395 
corollaries, 77 outliers, 417 
countable additivity, 75 squared error, 396 
measure, 76 regularization, 440 
non-negativity, 75 LASSO, 449 
normalization, 75 MATLAB and Python, 442 
probability density function, 172 parameter, 445 
definition, 175 ridge, 440 
discrete cases, 178 sparse solution, 449 
properties, 174 robust linear regression, 412 
intuition, 172 MATLAB and Python, 416 
per unit length, 173 linear programming, 414 
probability inequality, 323, 333 ROC 
probability law, 66 comparing performance, 599 
definition, 66 computation, 594 
examples, 66 definition, 591 
measure, 67 MATLAB and Python, 595 
probability mass function, 104, 110 properties, 593 
probability space Receiver operating characteristic, 591 
(Q,F,P), 58 
sample average, 320, 351 
Rademacher random variable, 140 sample space, 59 
random number generator, 228 continuous outcomes, 59 
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counterexamples, 61 
discrete outcomes, 59 
examples, 59 
exclusive, 61 
exhaustive, 61 
functions, 59 
set, 45 
associative, 56 
commutative, 56 
complement, 52 
countable, 45 
De Morgan’s Law, 57 
difference, 53 
disjoint, 54 
distributive, 56 
empty set, 48 
finite, 45 
improper subset, 47 
infinite, 45 
intersection, 50 
finite, 50 
infinite, 51 
of functions, 46 
partition, 55 
proper subset, 47 
subset, 47 
uncountable, 45 
union, 48 
finite, 48 
infinite, 49 
universal set, 48 
simplex method, 414 
skewness, 216 
MATLAB and Python, 217 
statistic, 320 
Student’s t-distribution 
definition, 556 
degrees of freedom, 557 
MATLAB and Python, 558 
relation to Gaussian, 557 
sum of random variables, 280 
Bernoulli, 327 
binomial, 328 
Gaussian, 283, 329 
Poisson, 328 
common distributions, 282 
convolution, 281 
symmetric matrices, 296 
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Taylor approximation, 10 
first-order, 11 
second-order, 11 
exponential, 12 
logarithmic, 13 

testing error, 420 
analysis, 424 

testing set, 420 

Three Prisoners problem, 92 

Toeplitz, 407, 633 

training error, 420 
analysis, 421 

training set, 420 

type 1 error 
definition, 581 
false alarm, 582 
false positive, 581 
power of test, 583 

type 2 error 
definition, 581 
false negative, 581 
miss, 582 


underdetermined system, 409 
uniform random variables, 202 

MATLAB and Python, 203 
union bound, 333 


validation, 165 

variance, 134 
properties, 135 
continuous case, 184 


white noise, 641 

wide-sense stationary, 632 
jointly, 652 

Wiener filter, 664 
deconvolution, 668 
definition, 664 
denoising, 665 
MATLAB and Python, 664 
power spectral density, 665 
recursive filter, 664 


Yule-Walker equation, 659 
MATLAB and Python, 662 


